ForkDeepInfraDeepInfrapublished Nov 14, 2025seen 5d

deepinfra/openbench

forked from groq/openbench

Open original ↗

Captured source

source ↗
published Nov 14, 2025seen 5dcaptured 9hhttp 200method plain

deepinfra/openbench

Description: Provider-agnostic, open-source evaluation infrastructure for language models

License: MIT

Stars: 0

Forks: 0

Open issues: 0

Created: 2025-11-14T00:18:56Z

Pushed: 2025-11-13T22:30:56Z

Default branch: main

Fork: yes

Parent repository: groq/openbench

Archived: no

README:

openbench

Provider-agnostic, open-source evaluation infrastructure for language models 🚀

![PyPI version](https://badge.fury.io/py/openbench)

openbench provides standardized, reproducible benchmarking for LLMs across 30+ evaluation suites (and growing) spanning knowledge, math, reasoning, coding, science, reading comprehension, health, long-context recall, graph reasoning, and first-class support for your own local evals to preserve privacy. Works with any model provider - Groq, OpenAI, Anthropic, Cohere, Google, AWS Bedrock, Azure, local models via Ollama, Hugging Face, and 30+ other providers.

🚧 Alpha Release

We're building in public! This is an alpha release - expect rapid iteration. The first stable release is coming soon.

Features

  • 🎯 35+ Benchmarks: MMLU, GPQA, HumanEval, SimpleQA, competition math (AIME, HMMT), SciCode, GraphWalks, and more
  • 🔧 Simple CLI: bench list, bench describe, bench eval (also available as openbench), -M/-T flags for model/task args, --debug mode for eval-retry, experimental benchmarks with --alpha flag
  • 🏗️ Built on inspect-ai: Industry-standard evaluation framework
  • 📊 Extensible: Easy to add new benchmarks and metrics
  • 🤖 Provider-agnostic: Works with 30+ model providers out of the box
  • 🛠️ Local Eval Support: Privatized benchmarks can be run with bench eval
  • 📤 Hugging Face Integration: Push evaluation results directly to Hugging Face datasets

🏃 Speedrun: Evaluate a Model in 60 Seconds

Prerequisite: Install uv

# Create a virtual environment and install openbench (30 seconds)
uv venv
source .venv/bin/activate
uv pip install openbench

# Set your API key (any provider!)
export GROQ_API_KEY=your_key # or OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.

# Run your first eval (30 seconds)
bench eval mmlu --model groq/llama-3.3-70b-versatile --limit 10

# That's it! 🎉 Check results in ./logs/ or view them in an interactive UI:
bench view

https://github.com/user-attachments/assets/e99e4628-f1f5-48e4-9df2-ae28b86168c2

Optional Plugins

Some benchmark suites ship as standalone plugins so they can iterate independently from the core distribution. Install them alongside openbench with uv pip and they will automatically appear in bench list via the plugin entry point system.

  • openbench-cyber: adds the CTI-Bench family plus CyBench (agentic CTF challenges). This plugin ships real exploit code and forensics artifacts that routinely trigger anti-malware scanners, so we require a deliberate, manual install after you read the security guidance.
  • Install explicitly: uv pip install "openbench-cyber @ git+https://github.com/groq/openbench-cyber.git@d93522ba70392cdceddb83f762c78a68923e70da"
  • Review the plugin README for sandbox requirements and risk acknowledgements before using it.

Using Different Providers

# Groq (blazing fast!)
bench eval gpqa_diamond --model groq/meta-llama/llama-4-maverick-17b-128e-instruct

# OpenAI
bench eval humaneval --model openai/o3-2025-04-16

# Anthropic
bench eval simpleqa --model anthropic/claude-sonnet-4-20250514

# Google
bench eval mmlu --model google/gemini-2.5-pro

# Local models with Ollama
bench eval musr --model ollama/llama3.1:70b

# Helicone AI Gateway
bench eval mmlu --model helicone/gpt-4o

# Hugging Face Inference Providers
bench eval mmlu --model huggingface/gpt-oss-120b:groq

# OpenRouter
bench eval gpqa_diamond --model openrouter/deepseek/deepseek-chat-v3.1

# 30+ providers supported - see full list below

Supported Providers

openbench supports 30+ model providers through Inspect AI. Set the appropriate API key environment variable and you're ready to go:

| Provider | Environment Variable | Example Model String | | --------------------- | ---------------------- | -------------------------------- | | AI21 Labs | AI21_API_KEY | ai21/model-name | | Anthropic | ANTHROPIC_API_KEY | anthropic/model-name | | AWS Bedrock | AWS credentials | bedrock/model-name | | Azure | AZURE_OPENAI_API_KEY | azure/ | | Baseten | BASETEN_API_KEY | baseten/model-name | | Cerebras | CEREBRAS_API_KEY | cerebras/model-name | | Cohere | COHERE_API_KEY | cohere/model-name | | Crusoe | CRUSOE_API_KEY | crusoe/model-name | | DeepInfra | DEEPINFRA_API_KEY | deepinfra/model-name | | Friendli | FRIENDLI_TOKEN | friendli/model-name | | Google | GOOGLE_API_KEY | google/model-name | | Groq | GROQ_API_KEY | groq/model-name | | Helicone | HELICONE_API_KEY | helicone/model-name | | Hugging Face | HF_TOKEN | huggingface/model-name | | Hyperbolic | HYPERBOLIC_API_KEY | hyperbolic/model-name | | Lambda | LAMBDA_API_KEY | lambda/model-name | | MiniMax | MINIMAX_API_KEY | minimax/model-name | | Mistral | MISTRAL_API_KEY | mistral/model-name | | Moonshot | MOONSHOT_API_KEY | moonshot/model-name | | Nebius | NEBIUS_API_KEY | nebius/model-name | | Nous Research | NOUS_API_KEY | nous/model-name | | Novita AI | NOVITA_API_KEY | novita/model-name | | Ollama | None (local) | ollama/model-name | | OpenAI | OPENAI_API_KEY | openai/model-name | | OpenRouter | OPENROUTER_API_KEY | openrouter/model-name | | Parasail | PARASAIL_API_KEY | parasail/model-name | | Perplexity | PERPLEXITY_API_KEY | perplexity/model-name | | Reka | REKA_API_KEY | reka/model-name | | SambaNova | SAMBANOVA_API_KEY | sambanova/model-name | | SiliconFlow | SILICONFLOW_API_KEY | siliconflow/model-name | | Together AI | TOGETHER_API_KEY | together/model-name | | Vercel AI Gateway | AI_GATEWAY_API_KEY | vercel/creator-name/model-name | | W&B Inference | WANDB_API_KEY | wandb/model-name | | vLLM | None (local) | vllm/model-name |

Available Benchmarks

Here are the currently available benchmarks. For an up-to-date list use bench list.

> [!NOTE] >…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine fork, no notable traction