RepoOpenBMB (MiniCPM)OpenBMB (MiniCPM)published Jun 10, 2026seen 1w

OpenBMB/MA-ProofBench

Lean

Open original ↗

Captured source

source ↗
published Jun 10, 2026seen 1wcaptured 1whttp 200method plain

OpenBMB/MA-ProofBench

Language: Lean

License: MIT

Stars: 2

Forks: 0

Open issues: 0

Created: 2026-06-10T09:48:54Z

Pushed: 2026-06-15T02:31:21Z

Default branch: main

Fork: no

Archived: no

README:

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

English | 中文

We introduce MA-ProofBench, to the best of our knowledge, the first formal benchmark for evaluating large language models (LLMs) on theorem proving in Mathematical Analysis. It contains 200 rigorously formalized theorem-proving problems in Lean 4 (v4.28.0), split into two difficulty tiers:

| Tier | Description | Source | Count | |--------------|----------------|------------------------------------------|-------| | Level I | Undergraduate | Basic Textbook Exercises | 100 | | Level II | Ph.D. | Exam Problems from Top-Tier Universities | 100 |

The problems span 6 core topics and 27 subcategories, including *measure and integration theory*, *complex analysis*, and *functional analysis*, among other categories. MA-ProofBench targets areas that are underrepresented in prior benchmarks and require deep reasoning about continuity, limits, and topological structures. Each problem is built through a human-led, LLM-assisted formalization pipeline with independent expert blind review to ensure mathematical fidelity.

Category Distribution

Problems are classified following the Mathematics Subject Classification (MSC) scheme:

| Category | Level I | Level II | | ------------------------------- | ------- | -------- | | Real Functions | 44 | 12 | | Functional Analysis | 15 | 31 | | Functions of a Complex Variable | 19 | 16 | | Measure & Integration | 13 | 17 | | Operator Theory | 4 | 23 | | Sequences, Series, Summability | 5 | 1 |

Evaluation

The evaluation/ package generates proofs via an OpenAI-compatible API and verifies them with Kimina Lean Server.

First, install the evaluation dependencies:

pip install -r requirements.txt

Then set up and start the Lean server (make sure it uses Lean v4.28.0):

cd kimina-lean-server

cp .env.template .env # Optional
bash setup.sh # Installs Lean, repl and mathlib4
pip install -r requirements.txt
pip install .
prisma generate
python -m server

With the Lean server running, launch the evaluation script with an OpenAI-compatible API endpoint:

EXTERNAL_API_URL=https://your-api/v1/chat/completions \
EXTERNAL_API_KEY=sk-xxx \
EVAL_MODEL_PATH=your-model-name \
NUM_SAMPLES=8 \
EVAL_MAX_TOKENS=32768 \
EVAL_TEMPERATURE=1.0 \
bash evaluation/run.sh

To run a stage manually:

# Inference only
EXTERNAL_API_URL=https://your-api/v1/chat/completions \
EXTERNAL_API_KEY=sk-xxx \
python -m evaluation.main --mode infer \
--model_path "$EVAL_MODEL_PATH" \
--output_path predictions.jsonl \
--split level1 level2 --num_samples 8 \
--eval_max_tokens 32768 --eval_temperature 1.0

# Verification only
python -m evaluation.main --mode eval \
--predictions_path predictions.jsonl \
--output_path results.jsonl \
--split level1 level2 --num_samples 8

Citation

@article{ma-proofbench,
title={MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis},
author={Lushi Pu and Weiming Zhang and Xinheng Xie and Zixuan Fu and Bingxiang He and Hongya Lyu and Xin Li and Jie Zhou and Yudong Wang},
year={2026},
eprint={2606.13782},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2606.13782},
}

License

This project is released under the [MIT License](LICENSE).

Notability

notability 2.0/10

Low traction, new benchmark repo