meituan-longcat/AMO-Bench
Python
Captured source
source ↗meituan-longcat/AMO-Bench
Description: This is the official repo for the paper "AMO-Bench: Large Language Models Still Struggle in High School Math Competitions".
Language: Python
License: MIT
Stars: 128
Forks: 3
Open issues: 1
Created: 2025-10-29T09:25:17Z
Pushed: 2026-02-06T02:59:58Z
Default branch: main
Fork: no
Archived: no
README:
📃 Paper • 🌐 Project Page • 🤗 Dataset
This is the official repo for the paper AMO-Bench: Large Language Models Still Struggle in High School Math Competitions.
Updates
- 2026.02.05: Leaderboard Update: [Qwen3-Max-Thinking](https://qwen.ai/blog?id=qwen3-max-thinking) achieves a new SOTA with 65.1%, while [GLM-4.7](https://z.ai/blog/glm-4.7) sets a new open-source record at 62.4%!
- 2025.12.01: We have added [Token Efficiency](#-token-efficiency) showing the number of output tokens used by models in the leaderboard. [Gemini 3 Pro](https://deepmind.google/models/gemini/pro/) achieves the highest token efficiency among top-performance models!
- 2025.11.24: [Gemini 3 Pro](https://deepmind.google/models/gemini/pro/) achieves 63.1%, setting a new SOTA and breaking 60% for the first time! We have updated the [Leaderboard](#-leaderboard) with the results of Gemini 3 Pro and Qwen3-Max-Thinking (Preview).
- 2025.11.19: Kimi-K2-Thinking achieves 56.0%, new SOTA on [Leaderboard](#-leaderboard)!
- 2025.11.05: The problem statement of Problem 35 has been revised in 🤗 Huggingface Dataset: (1) the five integers that sum to $k$ should be non-negative rather than positive, and (2) we also stipulate that 1 couldn't be replaced with five integers. Additionally, for the strictly positive case in the original problem statement, the correct answer should be 7656 (see this discussion for details). Thanks to the feedback from @applesilicon!
- 2025.10.31: We release the dataset, evaluation code, and technical report of AMO-Bench.
📊 Leaderboard
📈 Token Efficiency
📖 Abstract
We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation.
⭐ Key Features
- Original problems. To prevent performance leaks from existing resources as much as possible,
all problems in AMO-Bench are newly crafted by human experts. Moreover, we conduct a secondary verification to ensure that there are no highly similar problems in existing competitions or online resources.
- Guaranteed difficulty. Each problem has undergone rigorous cross-validation by multiple
experts to ensure it meets at least the difficulty standards of IMO. We also incorporate an LLM-based difficulty filtering stage to exclude questions that do not present sufficient challenge to current reasoning models.
- Final-answer based grading. Each problem in AMO-Bench requires a final answer rather than
a full proof, enabling efficient automatic grading. For each problem, we employ a parser-based or LLM-based grading method according to its answer type, balancing the grading cost and generalizability.
- Human-annotated reasoning paths. In addition to the final answer, each problem also includes
a detailed reasoning path written by human experts. These additional annotations enhance solution transparency and could support further explorations on AMO-Bench, such as prompt engineering and error analysis.
🛠️ Quick Start
Installation
1. Clone the repository:
git clone https://github.com/meituan-longcat/AMO-Bench.git cd AMO-Bench
2. Install dependencies:
pip install -r requirements.txt
Running evaluations
Step 1: Format Model Response File
After obtaining model responses, format them as follows (one JSON object per line):
{"question_id": 1, "model_response": "..."}
{"question_id": 2, "model_response": "..."}
...Save this file in the ./model_responses/ directory.
Step 2: Grading Responses
Set your API key and URL in lines 13-14 of utils.py. Then run:
python grading.py --response_file example.jsonl
Evaluation results will be saved under the ./grading_results/ directory.
Step 3 (Optional): Grade on AMO-Bench-P Subset
For a quick evaluation using only the parser-based subset (39 problems), run:
python grading.py --response_file example.jsonl --only_parser True
Discussions and Feedbacks
Here we summarize the discussions and feedbacks on AMO-Bench from the open-source community. We will regularly update the dataset to address urgent data issues.
We welcome any feedback you may have!
- Problem 26 appears to be effectively the same as an existing contest problem. Thanks to @applesilicon to point this out!
- The problem statement for Problem 35 should be further clarified: (1) the five integers that sum to $k$ should be non-negative rather than positive, and (2) we also stipulate that 1 couldn't be replaced with five integers. Additionally, for the strictly positive case in the original problem statement, the correct answer should be 7656 (see this discussion for details). Thanks to the suggestions from @applesilicon!
- Four problems involve complex numerical expressions (Problem 12, 13, 15 and 21). When tackling these problems, LLMs may struggle to…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New benchmark repo, moderate traction.