RepoMicrosoftMicrosoftpublished May 17, 2026seen 2d

microsoft/benchpress

Python

Open original ↗

Captured source

source ↗
published May 17, 2026seen 2dcaptured 2dhttp 200method plain

microsoft/benchpress

Description: BenchPress: calibrated LLM benchmark score completion

Language: Python

License: MIT

Stars: 0

Forks: 0

Open issues: 1

Created: 2026-05-17T21:29:19Z

Pushed: 2026-06-24T02:59:46Z

Default branch: main

Fork: no

Archived: no

README: You Don't Need to Run Every Eval

Yuchen Zeng, Dimitris Papailiopoulos

Microsoft Research, AI Frontiers

Project page · Code · Dataset · Paper

Abstract: A modern model release reports scores on 40+ benchmarks; behind the release, evaluations were run orders of magnitude more often across checkpoints, hyperparameter sweeps, and design choices. We ask whether scores accumulated across public releases can *anticipate* a model's performance on benchmarks it has not yet been run on, and decide which evaluations are most worth running next.

We compile a public score matrix of 84 frontier models on 133 benchmarks (2,604 observed cells, 23.3% filled) and find its geometry is approximately rank-2: across complete submatrices, two factors explain more than 90% of the variance. We exploit this structure with BenchPress: logit-space bias-decomposed rank-2 matrix completion, which completes hidden scores within a 4.6 score-point median absolute error. A reliability analysis identifies when these predictions can be trusted — errors fall when the target model has richer observed evidence and behaviorally similar peers — calibrating 90% prediction intervals.

Finally, we stress-test deployment: five probe benchmarks predict the rest of the profile to a median absolute error of 3.93 score points (4.55 on a low-cost allowlist) while preserving 92.1% of pairwise model rankings, and reach 5.0 on brand-new releases.

News 🚀

  • [TBD] BenchPress paper and code released.

Contents

  • [Step 1: Set Up Environment](#step-1-set-up-environment)
  • [Step 2: Download the Data](#step-2-download-the-data)
  • [Step 3: Predict Scores](#step-3-predict-scores)
  • [Predict for an Existing Model](#predict-for-an-existing-model)
  • [Add Your Own Model](#add-your-own-model)
  • [Step 4: Reproduce Paper Experiments](#step-4-reproduce-paper-experiments)
  • [Repository Structure](#repository-structure)
  • [Artifact Policy](#artifact-policy)
  • [Run a Single Experiment](#run-a-single-experiment)
  • [Step 5: Cite Us](#step-5-cite-us)

Step 1: Set Up Environment

To set up the environment for using BenchPress, please follow the steps below.

1. Clone this repository and rename it as benchpress

git clone https://github.com/microsoft/benchpress
cd benchpress

2. Install Packages

Linux / Mac

conda create -n benchpress python=3.10
conda activate benchpress
pip install -e . # editable install — makes `from benchpress.*` work everywhere
python -m benchpress.download_data

Windows TBA

Step 2: Download the Data

BenchPress uses a citation-backed evaluation matrix:

  • 189 frontier LLMs from 28 providers (OpenAI, Anthropic, Google, Meta, DeepSeek, Alibaba, Mistral, xAI, Moonshot AI, Zhipu AI, Microsoft, ByteDance, Amazon, MiniMax, NVIDIA, Cohere, Allen AI, IBM, Liquid AI, LG AI Research, Hugging Face, OpenBMB, TII, Sarvam AI, Shanghai AI Lab, Open Thoughts, Meituan, Mistral AI) — before filtering
  • 316 benchmarks across 59 categories — before filtering
  • 4,905 observed scores, each citation-backed
  • Paper-canonical filter (keep models with $\geq 15$ observed scores and benchmarks with $\geq 8$ observed models), with duplicate/setting-variant exclusions: 84 models × 133 benchmarks, 2,604 observed (23.3% fill rate)
  • Smart clip: only percentage-scale benchmarks are clipped to [0, 100]; Elo/rating benchmarks (Codeforces, Chatbot Arena, GDP-Val) are left unclipped

Observed (blue) vs. missing (white) cells in the paper-canonical 84 × 133 score matrix.

The paper-canonical dataset is published at:

  • Hugging Face:
  • Local cache after download: benchpress/data/llm_benchmark_data.json

BenchPress is a living dataset: new model releases, benchmark updates, and corrected citations can be added as the evaluation landscape changes. We welcome pull requests that add citation-backed scores, new models, new benchmarks, or provenance fixes.

After running python -m benchpress.download_data, local matrix files live in benchpress/data/:

benchpress/data/
├── llm_benchmark_data.json # Machine-readable scores
├── benchmark_cost_evidence.json # Raw cost-evidence extracts, when available
└── *.md # Data schema and provenance notes, when available

The downloader is artifact-first. If the full JSON export is available on Hugging Face, it downloads that exact artifact. If only the public table mirror is available, it reconstructs the paper score matrix from the mirror and prints the limitation explicitly; that fallback preserves the paper matrix but may not include every rich provenance field such as cost evidence and candidate-level score traces.

llm_benchmark_data.json is the canonical source for code. It contains:

  • models[]: model metadata, including provider and release/canonical-setting fields when available
  • benchmarks[]: benchmark metadata, including category, scale, canonical-setting fields, and cost evidence when available
  • scores[]: observed cells as {model_id, benchmark_id, score, reference_url} records

Use it directly from the package:

from benchpress.evaluation_harness import M_FULL, MODEL_IDX, BENCH_IDX

score = M_FULL[MODEL_IDX["gpt-5.2"], BENCH_IDX["gpqa_diamond"]]

or load the JSON yourself:

import json
from pathlib import Path

data = json.loads(Path("benchpress/data/llm_benchmark_data.json").read_text())
scores = data["scores"]

Or load the public Hugging Face mirror:

from datasets import load_dataset

ds = load_dataset("microsoft/benchpress-score-matrix", "scores_paper")

Step 3: Predict Scores

Predict for an Existing Model

# Predict all missing scores for a model
python predict.py --model gpt-5.2

# Predict a single score
python predict.py --model gpt-5.2 --benchmark gpqa_diamond

# List available models / benchmarks
python predict.py --list-models
python predict.py --list-benchmarks

Add Your Own Model

Provide a few known scores; BenchPress predicts the rest.

python predict.py --add-model my-model \
--scores "simpleqa=50.0,gpqa_diamond=70.0,aime_2025=55.0"

What happens under the hood

BenchPress uses Logit + Bias ALS:

1....

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

New benchmarking tool from Microsoft, moderate interest.