openai/frontier-evals
Python
Captured source
source ↗openai/frontier-evals
Description: OpenAI Frontier Evals
Language: Python
License: MIT
Stars: 1219
Forks: 162
Open issues: 12
Created: 2025-03-28T15:45:01Z
Pushed: 2026-04-21T20:53:31Z
Default branch: main
Fork: no
Archived: no
README:
Frontier Evals
Code for evals measuring frontier model capabilities.
- [PaperBench](./project/paperbench): End-to-end replication of state-of-the-art AI papers. Paper | Blog
- [SWE-Lancer](./project/swelancer): Real freelance software engineering tasks with end-to-end tests. Paper | Blog
- [EVMbench](./project/evmbench): Smart contract security tasks. Paper | Blog
Usage
Requirements
We manage environments with `uv`. Install uv once, then run uv sync (or uv pip install -r ...) inside the project of interest to create its virtual environment from the checked-in uv.lock.
Running Evals
Each eval directory documents how to reproduce runs, configure models, and interpret results. Start with the suite README.md, then consult any scripts under scripts/ or runtime_*/ directories for orchestration details. When in doubt:
1. cd into the eval directory. 2. uv sync to install dependencies. 3. Follow the local instructions in the README.md.
Contributing
Layout
. ├── pyproject.toml # Shared tooling configuration (Ruff, Black, etc.) └── project/ ├── common/ # Shared libraries ├── evmbench/ # EVMBench eval ├── paperbench/ # PaperBench eval └── swelancer/ # SWE-Lancer eval
Each eval directory is its own isolated project with a README.md, pyproject.toml and uv.lock.
Development Workflow
- Create or activate the environment for the project you are working on with
uv. Example for PaperBench: cd project/paperbenchuv syncuv run pytest- Code style and linting use Ruff (with autofix profiles in
pyproject.tomlandproject/common/tooling/ruff_autofix_minimal.toml) and Black. Runuv run ruff check --fixor use the provided Poe/maketasks where available. - Shared utilities live under
project/common; changes there may affect multiple evals. Bump the relevant editable dependencies if you create new shared subpackages.
Notability
notability 6.0/10New eval toolkit from OpenAI, moderate stars.