openai/mle-bench
Python
Captured source
source ↗openai/mle-bench
Description: MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
Language: Python
License: NOASSERTION
Stars: 1573
Forks: 253
Open issues: 8
Created: 2024-10-08T17:07:40Z
Pushed: 2026-04-24T17:33:44Z
Default branch: main
Fork: no
Archived: no
README:
MLE-bench
Code for the paper "MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering". We have released the code used to construct the dataset, the evaluation logic, as well as the agents we evaluated for this benchmark.
Leaderboard
*Update* (04-24-2026): We are currently not taking any new submissions to the leaderboard while we develop an improved process for ensuring submissions are fair and comparable. We will share updates on this process in the future.
| Agent | LLM(s) used | Low == Lite (%) | Medium (%) | High (%) | All (%) | Running Time (hours) | Date | Source Code Available | Grading Reports Available | |-------|-------------|-----------------|------------|----------|---------|----------------------|------|----------------------|---------------------------| | Famou-Agent 2.0 | Gemini-3-Pro-Preview | 80.3 ± 1.52 | 64.04 ± 2.32 | 42.22 ± 2.22 | 64.44 ± 1.18 | 24 | 2026-02-23 | X | ✓ | | AIBuildAI | Claude-Opus-4.6 | 77.27 ± 0.00 | 61.40 ± 0.88 | 46.67 ± 0.00 | 63.11 ± 0.44 | 24 | 2026-03-06 | X | ✓ | | CAIR MARS+ | Gemini-3-Pro-Preview | 78.79 ± 1.52 | 60.53 ± 1.52 | 44.44 ± 2.22 | 62.67 ± 0.77 | 24 | 2026-02-17 | X | ✓ | | MLEvolve | Gemini-3-Pro-Preview | 80.30 ± 1.52 | 57.89 ± 1.52 | 42.22 ± 2.22 | 61.33 ± 1.33 | 12 | 2026-02-14 | ✓ | ✓ | | PiEvolve (Fractal AI Research) | Gemini-3-Pro-Preview[^4] | 80.30 ± 1.52[^3] | 58.77 ± 0.88[^3] | 40.0 ± 0.00[^3] | 61.33 ± 0.77[^3] | 24 | 2026-01-05 | X | ✓ | | Famou-Agent 2.0 | Gemini-2.5-Pro | 75.76 ± 1.52 | 57.89 ± 1.52 | 40.00 ± 0.00 | 59.56 ± 0.89 | 24 | 2025-12-27 | X | ✓ | | ML-Master 2.0 | Deepseek-V3.2-Speciale | 75.76 ± 1.51 | 50.88 ± 3.51 | 42.22 ± 2.22 | 56.44 ± 2.47 | 24 | 2025-12-16 | X | ✓ | | CAIR MARS | Gemini-3-Pro-Preview | 74.24 ± 1.52 | 52.63 ± 3.04 | 37.78 ± 2.22 | 56.0 ± 1.54 | 24 | 2026-01-25 | X | ✓ | | PiEvolve (Fractal AI Research) | Gemini-3-Pro-Preview[^4] | 74.24 ± 3.03[^3] | 45.61 ± 0.88[^3] | 35.55 ± 2.22[^3] | 52.0 ± 0.77[^3] | 12 | 2026-01-05 | X | ✓ | | Leeroo | Gemini-3-Pro-Preview[^4] | 68.18 ± 2.62[^3] | 44.74 ± 1.52[^3] | 40.00 ± 0.00[^3] | 50.67 ± 1.33[^3] | 24 | 2025-12-07 | ✓ | ✓ | | Thesis | gpt-5-codex | 65.15 ± 1.52 | 45.61 ± 7.18 | 31.11 ± 2.22 | 48.44 ± 3.64 | 24 | 2025-11-10 | X | ✓ | | CAIR MLE-STAR-Pro-1.5 | Gemini-2.5-Pro | 68.18 ± 2.62 | 34.21 ± 1.52 | 33.33 ± 0.00 | 44.00 ± 1.33 | 24 | 2025-11-25 | X | ✓ | | Famou-Agent | Gemini-2.5-Pro | 62.12 ± 1.52 | 36.84 ± 1.52 | 33.33 ± 0.00 | 43.56 ± 0.89 | 24 | 2025-10-10 | X | ✓ | | Operand ensemble | gpt-5 (low verbosity/effort)[^2] | 63.64 ± 0.00 | 33.33 ± 0.88[^3] | 20.00 ± 0.00[^3] | 39.56 ± 0.44[^3] | 24 | 2025-10-06 | X | ✓ | | CAIR MLE-STAR-Pro-1.0 | Gemini-2.5-Pro | 66.67 ± 1.52 | 25.44 ± 0.88 | 31.11 ± 2.22 | 38.67 ± 0.77 | 12 | 2025-11-03 | X | ✓ | | InternAgent | deepseek-r1 | 62.12 ± 3.03 | 26.32 ± 2.63 | 24.44 ± 2.22 | 36.44 ± 1.18 | 12 | 2025-09-12 | X | ✓ | | R&D-Agent | gpt-5 | 68.18 ± 2.62 | 21.05 ± 1.52 | 22.22 ± 2.22 | 35.11 ± 0.44 | 12 | 2025-09-26 | ✓ | ✓ | | Neo multi-agent | undisclosed | 48.48 ± 1.52 | 29.82 ± 2.32 | 24.44 ± 2.22 | 34.22 ± 0.89 | 36 | 2025-07-28 | X | ✓ | | AIRA-dojo | o3 | 55.00 ± 1.47 | 21.97 ± 1.17 | 21.67 ± 1.07 | 31.60 ± 0.82 | 24 | 2025-05-15 | ✓ | ✓ | | R&D-Agent | o3 + GPT-4.1 | 51.52 ± 4.01 | 19.30 ± 3.16 | 26.67 ± 0.00 | 30.22 ± 0.89 | 24 | 2025-08-15 | ✓ | ✓ | | ML-Master | deepseek-r1 | 48.48 ± 1.52 | 20.18 ± 2.32 | 24.44 ± 2.22 | 29.33 ± 0.77 | 12 | 2025-06-17 | ✓ | ✓ | | R&D-Agent | o1-preview | 48.18 ± 1.11 | 8.95 ± 1.05 | 18.67 ± 1.33 | 22.40 ± 0.50 | 24 | 2025-05-14 | ✓ | ✓ | | AIDE | o1-preview | 35.91 ± 1.86 | 8.45 ± 0.43 | 11.67 ± 1.27 | 17.12 ± 0.61 | 24 | 2024-10-08 | ✓ | ✓ | | AIDE | gpt-4o-2024-08-06 | 18.55 ± 1.26 | 3.06 ± 0.33 | 8.15 ± 0.84 | 8.63 ± 0.54 | 24 | 2024-10-08 | ✓ | ✓ | | AIDE | claude-3-5-sonnet-20240620 | 19.70 ± 1.52 | 2.63 ± 1.52 | 2.22 ± 2.22 | 7.56 ± 1.60 | 24 | 2024-10-08 | ✓ | ✓ | | OpenHands | gpt-4o-2024-08-06 | 12.12 ± 1.52 | 1.75 ± 0.88 | 2.22 ± 2.22 | 4.89 ± 0.44 | 24 | 2024-10-08 | ✓ | ✓ | | AIDE | llama-3.1-405b-instruct | 10.23 ± 1.14 | 0.66 ± 0.66 | 0.00 ± 0.00 | 3.33 ± 0.38 | 24 | 2024-10-08 | ✓ | ✓ | | MLAB | gpt-4o-2024-08-06 | 4.55 ± 0.86 | 0.00 ± 0.00 | 0.00 ± 0.00 | 1.60 ± 0.27 | 24 | 2024-10-08 | ✓ | ✓ |
Additional Leaderboard Submissions
Additional submissions that are not directly comparable to the main leaderboard (see Notes column).
| Agent | LLM(s) used | Low == Lite (%) | Medium (%) | High (%) | All (%) | Running Time (hours) | Date | Notes | Source Code Available | Grading Reports Available | |-------|-------------|-----------------|------------|----------|---------|----------------------|------|-------|----------------------|---------------------------| | Disarray | Ensemble (Claude-Opus-4.5, Claude-Sonnet-4.5, GPT-5.2-Codex, Gemini-3-Pro-Preview) | 90.91 ± 0.00 | 72.81 ± 0.88 | 71.11 ± 2.22 | 77.78 ± 0.44 | 24 | 2026-02-03 | [Test-set...
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10New benchmark repo from OpenAI