RepoMeituan (LongCat)Meituan (LongCat)published Oct 29, 2025seen 5d

meituan-longcat/AMO-Bench

Python

Open original ↗

Captured source

source ↗
published Oct 29, 2025seen 5dcaptured 11hhttp 200method plain

meituan-longcat/AMO-Bench

Description: This is the official repo for the paper "AMO-Bench: Large Language Models Still Struggle in High School Math Competitions".

Language: Python

License: MIT

Stars: 128

Forks: 3

Open issues: 1

Created: 2025-10-29T09:25:17Z

Pushed: 2026-02-06T02:59:58Z

Default branch: main

Fork: no

Archived: no

README:

📃 Paper • 🌐 Project Page • 🤗 Dataset

This is the official repo for the paper AMO-Bench: Large Language Models Still Struggle in High School Math Competitions.

Updates

  • 2026.02.05: Leaderboard Update: [Qwen3-Max-Thinking](https://qwen.ai/blog?id=qwen3-max-thinking) achieves a new SOTA with 65.1%, while [GLM-4.7](https://z.ai/blog/glm-4.7) sets a new open-source record at 62.4%!
  • 2025.12.01: We have added [Token Efficiency](#-token-efficiency) showing the number of output tokens used by models in the leaderboard. [Gemini 3 Pro](https://deepmind.google/models/gemini/pro/) achieves the highest token efficiency among top-performance models!
  • 2025.11.24: [Gemini 3 Pro](https://deepmind.google/models/gemini/pro/) achieves 63.1%, setting a new SOTA and breaking 60% for the first time! We have updated the [Leaderboard](#-leaderboard) with the results of Gemini 3 Pro and Qwen3-Max-Thinking (Preview).
  • 2025.11.19: Kimi-K2-Thinking achieves 56.0%, new SOTA on [Leaderboard](#-leaderboard)!
  • 2025.11.05: The problem statement of Problem 35 has been revised in 🤗 Huggingface Dataset: (1) the five integers that sum to $k$ should be non-negative rather than positive, and (2) we also stipulate that 1 couldn't be replaced with five integers. Additionally, for the strictly positive case in the original problem statement, the correct answer should be 7656 (see this discussion for details). Thanks to the feedback from @applesilicon!
  • 2025.10.31: We release the dataset, evaluation code, and technical report of AMO-Bench.

📊 Leaderboard

📈 Token Efficiency

📖 Abstract

We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation.

⭐ Key Features

  • Original problems. To prevent performance leaks from existing resources as much as possible,

all problems in AMO-Bench are newly crafted by human experts. Moreover, we conduct a secondary verification to ensure that there are no highly similar problems in existing competitions or online resources.

  • Guaranteed difficulty. Each problem has undergone rigorous cross-validation by multiple

experts to ensure it meets at least the difficulty standards of IMO. We also incorporate an LLM-based difficulty filtering stage to exclude questions that do not present sufficient challenge to current reasoning models.

  • Final-answer based grading. Each problem in AMO-Bench requires a final answer rather than

a full proof, enabling efficient automatic grading. For each problem, we employ a parser-based or LLM-based grading method according to its answer type, balancing the grading cost and generalizability.

  • Human-annotated reasoning paths. In addition to the final answer, each problem also includes

a detailed reasoning path written by human experts. These additional annotations enhance solution transparency and could support further explorations on AMO-Bench, such as prompt engineering and error analysis.

🛠️ Quick Start

Installation

1. Clone the repository:

git clone https://github.com/meituan-longcat/AMO-Bench.git
cd AMO-Bench

2. Install dependencies:

pip install -r requirements.txt

Running evaluations

Step 1: Format Model Response File

After obtaining model responses, format them as follows (one JSON object per line):

{"question_id": 1, "model_response": "..."}
{"question_id": 2, "model_response": "..."}
...

Save this file in the ./model_responses/ directory.

Step 2: Grading Responses

Set your API key and URL in lines 13-14 of utils.py. Then run:

python grading.py --response_file example.jsonl

Evaluation results will be saved under the ./grading_results/ directory.

Step 3 (Optional): Grade on AMO-Bench-P Subset

For a quick evaluation using only the parser-based subset (39 problems), run:

python grading.py --response_file example.jsonl --only_parser True

Discussions and Feedbacks

Here we summarize the discussions and feedbacks on AMO-Bench from the open-source community. We will regularly update the dataset to address urgent data issues.

We welcome any feedback you may have!

  • Problem 26 appears to be effectively the same as an existing contest problem. Thanks to @applesilicon to point this out!
  • The problem statement for Problem 35 should be further clarified: (1) the five integers that sum to $k$ should be non-negative rather than positive, and (2) we also stipulate that 1 couldn't be replaced with five integers. Additionally, for the strictly positive case in the original problem statement, the correct answer should be 7656 (see this discussion for details). Thanks to the suggestions from @applesilicon!
  • Four problems involve complex numerical expressions (Problem 12, 13, 15 and 21). When tackling these problems, LLMs may struggle to…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New benchmark repo, moderate traction.