NVIDIA/QCalEval
Python
Captured source
source ↗NVIDIA/QCalEval
Description: Evaluation scripts for the QCalEval benchmark — a dataset for assessing vision-language model capabilities on quantum calibration experiment analysis.
Language: Python
License: Apache-2.0
Stars: 19
Forks: 2
Open issues: 0
Created: 2026-04-13T02:03:22Z
Pushed: 2026-04-23T23:08:50Z
Default branch: main
Fork: no
Archived: no
README:
QCalEval
Evaluation scripts for the QCalEval benchmark — a dataset for assessing vision-language model capabilities on quantum calibration experiment analysis. Data is loaded directly from HuggingFace. Compatible with any OpenAI-compatible API endpoint.
Top 5 Models (Zero-Shot, April 2026)
Based on the QCalEval benchmark findings, we release NVIDIA Ising Calibration 1, an open-weight 35B MoE model fine-tuned for zero-shot quantum calibration plot understanding.

| Label | Question | Task | |-------|----------|------| | Tech. Desc. | Q1 | Structured JSON description of plot type, axes, and salient visual features | | Exp. Status | Q2 | 4-way outcome classification: expected behavior, suboptimal parameters, anomalous behavior, or apparatus issue | | Reasoning | Q3 | Experiment-specific scientific analysis: what the pattern implies, whether the sweep is sufficient, and what calibration step follows | | Fit Rel. | Q4 | Assess whether a visible fit is trustworthy for downstream use: reliable, unreliable, or no fit | | Param. Ext. | Q5 | Extract family-specific physical parameters into structured JSON | | Cal. Diag. | Q6 | Assign a family-specific status code (e.g., SUCCESS, NO_SIGNAL) with corrective action |
Zero-Shot Leaderboard (April 2026)
Scores are per-question averages (0–100), judged by GPT-5.4.
| Type | Model | Mean | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | |------|-------|-----:|---:|---:|---:|---:|---:|---:| | NVIDIA | Ising-Cal-1-35B | 74.7 | 87.8 | 67.1 | 64.7 | 90.5 | 62.5 | 75.3 | | Closed | Gemini-3.1-Pro | 72.3 | 88.5 | 57.2 | 61.1 | 84.4 | 71.5 | 71.2 | | Open | Gemma-4-31B-IT | 68.8 | 85.6 | 54.3 | 59.8 | 82.7 | 68.3 | 62.1 | | Closed | Gemini-3.1-Flash-Lite | 68.2 | 89.2 | 53.5 | 59.4 | 82.7 | 63.8 | 60.9 | | Closed | Claude Opus 4.6 | 67.8 | 90.8 | 49.0 | 65.5 | 76.1 | 64.7 | 60.5 | | Closed | Claude Sonnet 4.6 | 66.5 | 89.7 | 48.6 | 63.4 | 76.5 | 60.4 | 60.1 | | Closed | GPT-5.4 | 64.6 | 90.9 | 52.7 | 63.7 | 54.7 | 64.3 | 61.3 | | Open | Qwen3.5-397B-A17B | 58.6 | 88.1 | 42.8 | 52.0 | 50.6 | 62.5 | 55.6 | | Open | Qwen3.5-27B | 58.5 | 87.0 | 45.7 | 48.3 | 56.4 | 58.7 | 55.1 | | Open | Qwen3.5-122B-A10B | 57.1 | 86.6 | 44.0 | 49.0 | 50.2 | 61.2 | 51.9 | | Closed | GPT-5.4-Mini | 55.7 | 90.3 | 39.5 | 48.3 | 42.0 | 62.6 | 51.4 | | Open | Qwen3.5-35B-A3B | 55.5 | 86.8 | 39.9 | 45.7 | 52.7 | 57.8 | 50.6 | | Open | Qwen3.5-9B | 53.0 | 81.5 | 37.9 | 39.5 | 49.8 | 57.1 | 52.3 | | Closed | Claude Haiku 4.5 | 50.5 | 83.4 | 36.6 | 40.8 | 48.6 | 51.0 | 42.8 | | Open | InternVL3-78B | 48.2 | 76.3 | 37.0 | 34.1 | 42.8 | 52.9 | 45.7 | | Open | MiniCPM-o-4.5 | 44.5 | 76.7 | 31.7 | 29.8 | 32.5 | 47.9 | 48.1 | | Open | InternVL3-38B | 44.1 | 79.2 | 34.6 | 27.6 | 33.7 | 49.2 | 40.3 | | Open | Kimi-VL-A3B | 38.9 | 65.0 | 34.6 | 22.1 | 35.0 | 38.9 | 37.4 |
In-Context Learning (ICL) Leaderboard (April 2026)
In ICL mode, the model receives labeled demonstration examples from the same experiment family before the query plot — showing what a correct answer looks like for similar data. Q3 and Q6 use N-way demonstrations (multiple examples from the family), while Q5 uses a single 1-shot demonstration with the extraction schema. Delta shows change from zero-shot.
| Type | Model | Mean | Q3 | Delta | Q5 | Delta | Q6 | Delta | |------|-------|-----:|----|------:|----|------:|----|------:| | Closed | Gemini-3.1-Pro | 85.2 | 81.3 | +20.2 | 84.5 | +13.0 | 89.8 | +18.6 | | Closed | Claude Opus 4.6 | 85.1 | 84.7 | +19.2 | 81.3 | +16.6 | 89.4 | +28.9 | | Open | Gemma-4-31B-IT | 81.2 | 80.6 | +20.8 | 76.9 | +8.6 | 86.0 | +23.9 | | Closed | GPT-5.4 | 78.4 | 81.0 | +17.3 | 72.9 | +8.6 | 81.4 | +20.1 | | Closed | Gemini-3.1-Flash-Lite | 78.1 | 78.5 | +19.1 | 73.6 | +9.8 | 82.2 | +21.3 | | Closed | Claude Sonnet 4.6 | 75.9 | 77.8 | +14.4 | 71.9 | +11.5 | 78.0 | +17.9 | | Closed | GPT-5.4-Mini | 66.1 | 58.8 | +10.5 | 72.7 | +10.1 | 66.9 | +15.5 | | Closed | Claude Haiku 4.5 | 66.0 | 66.1 | +25.3 | 58.7 | +7.7 | 73.1 | +30.3 | | Open | InternVL3-38B | 56.9 | 56.2 | +28.6 | 59.5 | +10.3 | 55.1 | +14.8 | | Open | Qwen3.5-27B | 53.0 | 41.8 | -6.5 | 71.5 | +12.8 | 45.8 | -9.3 | | Open | Qwen3.5-397B-A17B | 48.0 | 37.4 | -14.6 | 64.3 | +1.8 | 42.4 | -13.2 | | Open | InternVL3-78B | 47.0 | 50.5 | +16.4 | 46.2 | -6.7 | 44.3 | -1.4 | | Open | Qwen3.5-122B-A10B | 44.6 | 36.1 | -12.9 | 62.5 | +1.3 | 35.2 | -16.7 | | Open | Qwen3.5-35B-A3B | 43.9 | 33.4 | -12.3 | 64.4 | +6.6 | 33.9 | -16.7 | | Open | Qwen3.5-9B | 43.2 | 32.8 | -6.7 | 63.0 | +5.9 | 33.9 | -18.4 | | Open | Kimi-VL-A3B | 40.6 | 34.9 | +12.8 | 54.3 | +15.4 | 32.6 | -4.8 | | Open | MiniCPM-o-4.5 | 33.0 | 19.3 | -10.5 | 50.5 | +2.6 | 29.2 | -18.9 |
Setup
pip install -r requirements.txt
Or with Nix (flakes enabled):
nix develop # dev shell with all dependencies nix run .#zeroshot -- --help # also: .#icl, .#judge
Scripts
benchmark_zeroshot.py — Zero-shot evaluation
Sends each image + question independently (6 requests per entry).
# OpenAI API python benchmark_zeroshot.py \ --api-base https://api.openai.com/v1/chat/completions \ --model-id gpt-5.4 \ --api-key-env OPENAI_API_KEY \ --output results_zeroshot.json # Local vLLM / NIM endpoint python benchmark_zeroshot.py \ --api-base http://localhost:8000/v1/chat/completions \ --model-id my-model \ --api-key dummy \ --output results_zeroshot.json # With options python benchmark_zeroshot.py \ --api-base https://api.openai.com/v1/chat/completions \ --model-id gpt-5.4 \ --concurrency 128 \ --limit 10 \ --output results.json
benchmark_icl.py — In-context learning (ICL) evaluation
Runs 3 questions per entry (Q3, Q5, Q6) with in-context demonstration examples.
python benchmark_icl.py \ --api-base https://api.openai.com/v1/chat/completions \ --model-id gpt-5.4 \ --api-key-env OPENAI_API_KEY \ --output results_icl.json # For models that use…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Low stars, routine repo
NVIDIA has a repo signal matching data demand, evals and quality.