microsoft/benchpress
Python
Captured source
source ↗microsoft/benchpress
Description: BenchPress: calibrated LLM benchmark score completion
Language: Python
License: MIT
Stars: 0
Forks: 0
Open issues: 1
Created: 2026-05-17T21:29:19Z
Pushed: 2026-06-24T02:59:46Z
Default branch: main
Fork: no
Archived: no
README: You Don't Need to Run Every Eval
Yuchen Zeng, Dimitris Papailiopoulos
Microsoft Research, AI Frontiers
Project page · Code · Dataset · Paper
Abstract: A modern model release reports scores on 40+ benchmarks; behind the release, evaluations were run orders of magnitude more often across checkpoints, hyperparameter sweeps, and design choices. We ask whether scores accumulated across public releases can *anticipate* a model's performance on benchmarks it has not yet been run on, and decide which evaluations are most worth running next.
We compile a public score matrix of 84 frontier models on 133 benchmarks (2,604 observed cells, 23.3% filled) and find its geometry is approximately rank-2: across complete submatrices, two factors explain more than 90% of the variance. We exploit this structure with BenchPress: logit-space bias-decomposed rank-2 matrix completion, which completes hidden scores within a 4.6 score-point median absolute error. A reliability analysis identifies when these predictions can be trusted — errors fall when the target model has richer observed evidence and behaviorally similar peers — calibrating 90% prediction intervals.
Finally, we stress-test deployment: five probe benchmarks predict the rest of the profile to a median absolute error of 3.93 score points (4.55 on a low-cost allowlist) while preserving 92.1% of pairwise model rankings, and reach 5.0 on brand-new releases.
News 🚀
- [TBD] BenchPress paper and code released.
Contents
- [Step 1: Set Up Environment](#step-1-set-up-environment)
- [Step 2: Download the Data](#step-2-download-the-data)
- [Step 3: Predict Scores](#step-3-predict-scores)
- [Predict for an Existing Model](#predict-for-an-existing-model)
- [Add Your Own Model](#add-your-own-model)
- [Step 4: Reproduce Paper Experiments](#step-4-reproduce-paper-experiments)
- [Repository Structure](#repository-structure)
- [Artifact Policy](#artifact-policy)
- [Run a Single Experiment](#run-a-single-experiment)
- [Step 5: Cite Us](#step-5-cite-us)
Step 1: Set Up Environment
To set up the environment for using BenchPress, please follow the steps below.
1. Clone this repository and rename it as benchpress
git clone https://github.com/microsoft/benchpress cd benchpress
2. Install Packages
Linux / Mac
conda create -n benchpress python=3.10 conda activate benchpress pip install -e . # editable install — makes `from benchpress.*` work everywhere python -m benchpress.download_data
Windows TBA
Step 2: Download the Data
BenchPress uses a citation-backed evaluation matrix:
- 189 frontier LLMs from 28 providers (OpenAI, Anthropic, Google, Meta, DeepSeek, Alibaba, Mistral, xAI, Moonshot AI, Zhipu AI, Microsoft, ByteDance, Amazon, MiniMax, NVIDIA, Cohere, Allen AI, IBM, Liquid AI, LG AI Research, Hugging Face, OpenBMB, TII, Sarvam AI, Shanghai AI Lab, Open Thoughts, Meituan, Mistral AI) — before filtering
- 316 benchmarks across 59 categories — before filtering
- 4,905 observed scores, each citation-backed
- Paper-canonical filter (keep models with $\geq 15$ observed scores and benchmarks with $\geq 8$ observed models), with duplicate/setting-variant exclusions: 84 models × 133 benchmarks, 2,604 observed (23.3% fill rate)
- Smart clip: only percentage-scale benchmarks are clipped to [0, 100]; Elo/rating benchmarks (Codeforces, Chatbot Arena, GDP-Val) are left unclipped
Observed (blue) vs. missing (white) cells in the paper-canonical 84 × 133 score matrix.
The paper-canonical dataset is published at:
- Hugging Face:
- Local cache after download:
benchpress/data/llm_benchmark_data.json
BenchPress is a living dataset: new model releases, benchmark updates, and corrected citations can be added as the evaluation landscape changes. We welcome pull requests that add citation-backed scores, new models, new benchmarks, or provenance fixes.
After running python -m benchpress.download_data, local matrix files live in benchpress/data/:
benchpress/data/ ├── llm_benchmark_data.json # Machine-readable scores ├── benchmark_cost_evidence.json # Raw cost-evidence extracts, when available └── *.md # Data schema and provenance notes, when available
The downloader is artifact-first. If the full JSON export is available on Hugging Face, it downloads that exact artifact. If only the public table mirror is available, it reconstructs the paper score matrix from the mirror and prints the limitation explicitly; that fallback preserves the paper matrix but may not include every rich provenance field such as cost evidence and candidate-level score traces.
llm_benchmark_data.json is the canonical source for code. It contains:
models[]: model metadata, including provider and release/canonical-setting fields when availablebenchmarks[]: benchmark metadata, including category, scale, canonical-setting fields, and cost evidence when availablescores[]: observed cells as{model_id, benchmark_id, score, reference_url}records
Use it directly from the package:
from benchpress.evaluation_harness import M_FULL, MODEL_IDX, BENCH_IDX score = M_FULL[MODEL_IDX["gpt-5.2"], BENCH_IDX["gpqa_diamond"]]
or load the JSON yourself:
import json
from pathlib import Path
data = json.loads(Path("benchpress/data/llm_benchmark_data.json").read_text())
scores = data["scores"]Or load the public Hugging Face mirror:
from datasets import load_dataset
ds = load_dataset("microsoft/benchpress-score-matrix", "scores_paper")Step 3: Predict Scores
Predict for an Existing Model
# Predict all missing scores for a model python predict.py --model gpt-5.2 # Predict a single score python predict.py --model gpt-5.2 --benchmark gpqa_diamond # List available models / benchmarks python predict.py --list-models python predict.py --list-benchmarks
Add Your Own Model
Provide a few known scores; BenchPress predicts the rest.
python predict.py --add-model my-model \ --scores "simpleqa=50.0,gpqa_diamond=70.0,aime_2025=55.0"
What happens under the hood
BenchPress uses Logit + Bias ALS:
1....
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10New benchmarking tool from Microsoft, moderate interest.