RepoGroqGroqpublished Jul 31, 2025seen 5d

groq/openbench

Python

Open original ↗

Captured source

source ↗
published Jul 31, 2025seen 5dcaptured 13hhttp 200method plain

groq/openbench

Description: Provider-agnostic, open-source evaluation infrastructure for language models

Language: Python

License: MIT

Stars: 782

Forks: 101

Open issues: 3

Created: 2025-07-31T08:02:49Z

Pushed: 2026-04-28T16:46:13Z

Default branch: main

Fork: no

Archived: no

README:

openbench

Provider-agnostic, open-source evaluation infrastructure for language models

openbench provides standardized, reproducible benchmarking for LLMs across 30+ evaluation suites (and growing) spanning knowledge, math, reasoning, coding, science, reading comprehension, health, long-context recall, graph reasoning, and first-class support for your own local evals to preserve privacy. Works with any model provider - Groq, OpenAI, Anthropic, Cohere, Google, AWS Bedrock, Azure, local models via Ollama, Hugging Face, and 30+ other providers.

To get started, see the tutorial below or reference the docs.

Features

  • 🎯 95+ Benchmarks: MMLU, GPQA, HumanEval, SimpleQA, competition math (AIME, HMMT), SciCode, GraphWalks, and more
  • 🔧 Simple CLI: bench list, bench describe, bench eval (also available as openbench), -M/-T flags for model/task args, --debug mode for eval-retry, experimental benchmarks with --alpha flag
  • 🏗️ Built on inspect-ai: Industry-standard evaluation framework
  • 📊 Extensible: Easy to add new benchmarks and metrics
  • 🤖 Provider-agnostic: Works with 30+ model providers out of the box
  • 🛠️ Local Eval Support: Privatized benchmarks can be run with bench eval
  • 📤 Hugging Face Integration: Push evaluation results directly to Hugging Face datasets

🏃 Speedrun: Evaluate a Model in 60 Seconds

Prerequisite: Install uv

# Create a virtual environment and install openbench (30 seconds)
uv venv
source .venv/bin/activate
uv pip install openbench

# Set your API key (any provider!)
export GROQ_API_KEY=your_key # or OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.

# Run your first eval (3 seconds)
bench eval mmlu --model groq/openai/gpt-oss-120b --limit 10

# That's it! 🎉 Check results in ./logs/ or view them in an interactive UI:
bench view

https://github.com/user-attachments/assets/e99e4628-f1f5-48e4-9df2-ae28b86168c2

Supported Providers

openbench supports 30+ model providers through Inspect AI. Set the appropriate API key environment variable and you're ready to go:

| Provider | Environment Variable | Example Model String | | --------------------- | ---------------------- | -------------------------------- | | AI21 Labs | AI21_API_KEY | ai21/model-name | | Anthropic | ANTHROPIC_API_KEY | anthropic/model-name | | AWS Bedrock | AWS credentials | bedrock/model-name | | Azure | AZURE_OPENAI_API_KEY | azure/ | | Baseten | BASETEN_API_KEY | baseten/model-name | | Cerebras | CEREBRAS_API_KEY | cerebras/model-name | | Cohere | COHERE_API_KEY | cohere/model-name | | Crusoe | CRUSOE_API_KEY | crusoe/model-name | | DeepInfra | DEEPINFRA_API_KEY | deepinfra/model-name | | Friendli | FRIENDLI_TOKEN | friendli/model-name | | Google | GOOGLE_API_KEY | google/model-name | | Groq | GROQ_API_KEY | groq/model-name | | Helicone | HELICONE_API_KEY | helicone/model-name | | Hugging Face | HF_TOKEN | huggingface/model-name | | Hyperbolic | HYPERBOLIC_API_KEY | hyperbolic/model-name | | Lambda | LAMBDA_API_KEY | lambda/model-name | | MiniMax | MINIMAX_API_KEY | minimax/model-name | | Mistral | MISTRAL_API_KEY | mistral/model-name | | Moonshot | MOONSHOT_API_KEY | moonshot/model-name | | Nebius | NEBIUS_API_KEY | nebius/model-name | | Nous Research | NOUS_API_KEY | nous/model-name | | Novita AI | NOVITA_API_KEY | novita/model-name | | Ollama | None (local) | ollama/model-name | | OpenAI | OPENAI_API_KEY | openai/model-name | | OpenRouter | OPENROUTER_API_KEY | openrouter/model-name | | Parasail | PARASAIL_API_KEY | parasail/model-name | | Perplexity | PERPLEXITY_API_KEY | perplexity/model-name | | Reka | REKA_API_KEY | reka/model-name | | SambaNova | SAMBANOVA_API_KEY | sambanova/model-name | | SiliconFlow | SILICONFLOW_API_KEY | siliconflow/model-name | | Together AI | TOGETHER_API_KEY | together/model-name | | Vercel AI Gateway | AI_GATEWAY_API_KEY | vercel/creator-name/model-name | | W&B Inference | WANDB_API_KEY | wandb/model-name | | vLLM | None (local) | vllm/model-name |

Available Benchmarks

See the Benchmarks Catalog or use bench list.

Commands and Options

For a complete list of all commands and options, run: bench --help See the docs for more details.

| Command | Description | | ------------------------ | -------------------------------------------------- | | bench list | List available benchmarks | | bench eval | Run benchmark evaluation | | bench eval-retry | Retry a failed evaluation | | bench view | Interactive UI to view benchmark logs | | bench cache | Manage OpenBench caches |

Common eval Configuration Options

| Option | Environment Variable | Default | Description | | -------------------- |---------------------------|---------------------------|------------------------------------------------------------------| | -M | None | None | Pass provider/model-specific arguments (e.g., -M only=groq) | | -T | None | None | Pass task-specific arguments to the benchmark | | --model | BENCH_MODEL | groq/openai/gpt-oss-20b | Model(s) to evaluate | | --epochs | BENCH_EPOCHS | 1 | Number of epochs to run each evaluation | | --epochs-reducer | BENCH_EPOCHS_REDUCER | None | Reducer(s) applied when aggregating epoch scores | | --max-connections | BENCH_MAX_CONNECTIONS | 10 | Maximum parallel requests to model | | --temperature | BENCH_TEMPERATURE | 0.6 | Model temperature | | --top-p | BENCH_TOP_P | 1.0 | Model top-p | | --max-tokens | BENCH_MAX_TOKENS | None | Maximum tokens for model response | | --seed | BENCH_SEED | None | Seed for deterministic generation | | --limit | BENCH_LIMIT | None | Limit evaluated samples (number or start,end) | | --logfile | BENCH_OUTPUT | None | Output file…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable open-source benchmark tool with solid traction.