deepinfra/openbench
forked from groq/openbench
Captured source
source ↗deepinfra/openbench
Description: Provider-agnostic, open-source evaluation infrastructure for language models
License: MIT
Stars: 0
Forks: 0
Open issues: 0
Created: 2025-11-14T00:18:56Z
Pushed: 2025-11-13T22:30:56Z
Default branch: main
Fork: yes
Parent repository: groq/openbench
Archived: no
README:
openbench
Provider-agnostic, open-source evaluation infrastructure for language models 🚀

openbench provides standardized, reproducible benchmarking for LLMs across 30+ evaluation suites (and growing) spanning knowledge, math, reasoning, coding, science, reading comprehension, health, long-context recall, graph reasoning, and first-class support for your own local evals to preserve privacy. Works with any model provider - Groq, OpenAI, Anthropic, Cohere, Google, AWS Bedrock, Azure, local models via Ollama, Hugging Face, and 30+ other providers.
🚧 Alpha Release
We're building in public! This is an alpha release - expect rapid iteration. The first stable release is coming soon.
Features
- 🎯 35+ Benchmarks: MMLU, GPQA, HumanEval, SimpleQA, competition math (AIME, HMMT), SciCode, GraphWalks, and more
- 🔧 Simple CLI:
bench list,bench describe,bench eval(also available asopenbench),-M/-Tflags for model/task args,--debugmode for eval-retry, experimental benchmarks with--alphaflag - 🏗️ Built on inspect-ai: Industry-standard evaluation framework
- 📊 Extensible: Easy to add new benchmarks and metrics
- 🤖 Provider-agnostic: Works with 30+ model providers out of the box
- 🛠️ Local Eval Support: Privatized benchmarks can be run with
bench eval - 📤 Hugging Face Integration: Push evaluation results directly to Hugging Face datasets
🏃 Speedrun: Evaluate a Model in 60 Seconds
Prerequisite: Install uv
# Create a virtual environment and install openbench (30 seconds) uv venv source .venv/bin/activate uv pip install openbench # Set your API key (any provider!) export GROQ_API_KEY=your_key # or OPENAI_API_KEY, ANTHROPIC_API_KEY, etc. # Run your first eval (30 seconds) bench eval mmlu --model groq/llama-3.3-70b-versatile --limit 10 # That's it! 🎉 Check results in ./logs/ or view them in an interactive UI: bench view
https://github.com/user-attachments/assets/e99e4628-f1f5-48e4-9df2-ae28b86168c2
Optional Plugins
Some benchmark suites ship as standalone plugins so they can iterate independently from the core distribution. Install them alongside openbench with uv pip and they will automatically appear in bench list via the plugin entry point system.
openbench-cyber: adds the CTI-Bench family plus CyBench (agentic CTF challenges). This plugin ships real exploit code and forensics artifacts that routinely trigger anti-malware scanners, so we require a deliberate, manual install after you read the security guidance.- Install explicitly:
uv pip install "openbench-cyber @ git+https://github.com/groq/openbench-cyber.git@d93522ba70392cdceddb83f762c78a68923e70da" - Review the plugin README for sandbox requirements and risk acknowledgements before using it.
Using Different Providers
# Groq (blazing fast!) bench eval gpqa_diamond --model groq/meta-llama/llama-4-maverick-17b-128e-instruct # OpenAI bench eval humaneval --model openai/o3-2025-04-16 # Anthropic bench eval simpleqa --model anthropic/claude-sonnet-4-20250514 # Google bench eval mmlu --model google/gemini-2.5-pro # Local models with Ollama bench eval musr --model ollama/llama3.1:70b # Helicone AI Gateway bench eval mmlu --model helicone/gpt-4o # Hugging Face Inference Providers bench eval mmlu --model huggingface/gpt-oss-120b:groq # OpenRouter bench eval gpqa_diamond --model openrouter/deepseek/deepseek-chat-v3.1 # 30+ providers supported - see full list below
Supported Providers
openbench supports 30+ model providers through Inspect AI. Set the appropriate API key environment variable and you're ready to go:
| Provider | Environment Variable | Example Model String | | --------------------- | ---------------------- | -------------------------------- | | AI21 Labs | AI21_API_KEY | ai21/model-name | | Anthropic | ANTHROPIC_API_KEY | anthropic/model-name | | AWS Bedrock | AWS credentials | bedrock/model-name | | Azure | AZURE_OPENAI_API_KEY | azure/ | | Baseten | BASETEN_API_KEY | baseten/model-name | | Cerebras | CEREBRAS_API_KEY | cerebras/model-name | | Cohere | COHERE_API_KEY | cohere/model-name | | Crusoe | CRUSOE_API_KEY | crusoe/model-name | | DeepInfra | DEEPINFRA_API_KEY | deepinfra/model-name | | Friendli | FRIENDLI_TOKEN | friendli/model-name | | Google | GOOGLE_API_KEY | google/model-name | | Groq | GROQ_API_KEY | groq/model-name | | Helicone | HELICONE_API_KEY | helicone/model-name | | Hugging Face | HF_TOKEN | huggingface/model-name | | Hyperbolic | HYPERBOLIC_API_KEY | hyperbolic/model-name | | Lambda | LAMBDA_API_KEY | lambda/model-name | | MiniMax | MINIMAX_API_KEY | minimax/model-name | | Mistral | MISTRAL_API_KEY | mistral/model-name | | Moonshot | MOONSHOT_API_KEY | moonshot/model-name | | Nebius | NEBIUS_API_KEY | nebius/model-name | | Nous Research | NOUS_API_KEY | nous/model-name | | Novita AI | NOVITA_API_KEY | novita/model-name | | Ollama | None (local) | ollama/model-name | | OpenAI | OPENAI_API_KEY | openai/model-name | | OpenRouter | OPENROUTER_API_KEY | openrouter/model-name | | Parasail | PARASAIL_API_KEY | parasail/model-name | | Perplexity | PERPLEXITY_API_KEY | perplexity/model-name | | Reka | REKA_API_KEY | reka/model-name | | SambaNova | SAMBANOVA_API_KEY | sambanova/model-name | | SiliconFlow | SILICONFLOW_API_KEY | siliconflow/model-name | | Together AI | TOGETHER_API_KEY | together/model-name | | Vercel AI Gateway | AI_GATEWAY_API_KEY | vercel/creator-name/model-name | | W&B Inference | WANDB_API_KEY | wandb/model-name | | vLLM | None (local) | vllm/model-name |
Available Benchmarks
Here are the currently available benchmarks. For an up-to-date list use bench list.
> [!NOTE] >…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine fork, no notable traction