RepoNVIDIANVIDIApublished Apr 13, 2026seen 5d

NVIDIA/QCalEval

Python

Open original ↗

Captured source

source ↗
published Apr 13, 2026seen 5dcaptured 11hhttp 200method plain

NVIDIA/QCalEval

Description: Evaluation scripts for the QCalEval benchmark — a dataset for assessing vision-language model capabilities on quantum calibration experiment analysis.

Language: Python

License: Apache-2.0

Stars: 19

Forks: 2

Open issues: 0

Created: 2026-04-13T02:03:22Z

Pushed: 2026-04-23T23:08:50Z

Default branch: main

Fork: no

Archived: no

README:

QCalEval

Evaluation scripts for the QCalEval benchmark — a dataset for assessing vision-language model capabilities on quantum calibration experiment analysis. Data is loaded directly from HuggingFace. Compatible with any OpenAI-compatible API endpoint.

Top 5 Models (Zero-Shot, April 2026)

Based on the QCalEval benchmark findings, we release NVIDIA Ising Calibration 1, an open-weight 35B MoE model fine-tuned for zero-shot quantum calibration plot understanding.

![QCalEval Zero-Shot Leaderboard — Top 5 Models](leaderboard_top5.svg)

| Label | Question | Task | |-------|----------|------| | Tech. Desc. | Q1 | Structured JSON description of plot type, axes, and salient visual features | | Exp. Status | Q2 | 4-way outcome classification: expected behavior, suboptimal parameters, anomalous behavior, or apparatus issue | | Reasoning | Q3 | Experiment-specific scientific analysis: what the pattern implies, whether the sweep is sufficient, and what calibration step follows | | Fit Rel. | Q4 | Assess whether a visible fit is trustworthy for downstream use: reliable, unreliable, or no fit | | Param. Ext. | Q5 | Extract family-specific physical parameters into structured JSON | | Cal. Diag. | Q6 | Assign a family-specific status code (e.g., SUCCESS, NO_SIGNAL) with corrective action |

Zero-Shot Leaderboard (April 2026)

Scores are per-question averages (0–100), judged by GPT-5.4.

| Type | Model | Mean | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | |------|-------|-----:|---:|---:|---:|---:|---:|---:| | NVIDIA | Ising-Cal-1-35B | 74.7 | 87.8 | 67.1 | 64.7 | 90.5 | 62.5 | 75.3 | | Closed | Gemini-3.1-Pro | 72.3 | 88.5 | 57.2 | 61.1 | 84.4 | 71.5 | 71.2 | | Open | Gemma-4-31B-IT | 68.8 | 85.6 | 54.3 | 59.8 | 82.7 | 68.3 | 62.1 | | Closed | Gemini-3.1-Flash-Lite | 68.2 | 89.2 | 53.5 | 59.4 | 82.7 | 63.8 | 60.9 | | Closed | Claude Opus 4.6 | 67.8 | 90.8 | 49.0 | 65.5 | 76.1 | 64.7 | 60.5 | | Closed | Claude Sonnet 4.6 | 66.5 | 89.7 | 48.6 | 63.4 | 76.5 | 60.4 | 60.1 | | Closed | GPT-5.4 | 64.6 | 90.9 | 52.7 | 63.7 | 54.7 | 64.3 | 61.3 | | Open | Qwen3.5-397B-A17B | 58.6 | 88.1 | 42.8 | 52.0 | 50.6 | 62.5 | 55.6 | | Open | Qwen3.5-27B | 58.5 | 87.0 | 45.7 | 48.3 | 56.4 | 58.7 | 55.1 | | Open | Qwen3.5-122B-A10B | 57.1 | 86.6 | 44.0 | 49.0 | 50.2 | 61.2 | 51.9 | | Closed | GPT-5.4-Mini | 55.7 | 90.3 | 39.5 | 48.3 | 42.0 | 62.6 | 51.4 | | Open | Qwen3.5-35B-A3B | 55.5 | 86.8 | 39.9 | 45.7 | 52.7 | 57.8 | 50.6 | | Open | Qwen3.5-9B | 53.0 | 81.5 | 37.9 | 39.5 | 49.8 | 57.1 | 52.3 | | Closed | Claude Haiku 4.5 | 50.5 | 83.4 | 36.6 | 40.8 | 48.6 | 51.0 | 42.8 | | Open | InternVL3-78B | 48.2 | 76.3 | 37.0 | 34.1 | 42.8 | 52.9 | 45.7 | | Open | MiniCPM-o-4.5 | 44.5 | 76.7 | 31.7 | 29.8 | 32.5 | 47.9 | 48.1 | | Open | InternVL3-38B | 44.1 | 79.2 | 34.6 | 27.6 | 33.7 | 49.2 | 40.3 | | Open | Kimi-VL-A3B | 38.9 | 65.0 | 34.6 | 22.1 | 35.0 | 38.9 | 37.4 |

In-Context Learning (ICL) Leaderboard (April 2026)

In ICL mode, the model receives labeled demonstration examples from the same experiment family before the query plot — showing what a correct answer looks like for similar data. Q3 and Q6 use N-way demonstrations (multiple examples from the family), while Q5 uses a single 1-shot demonstration with the extraction schema. Delta shows change from zero-shot.

| Type | Model | Mean | Q3 | Delta | Q5 | Delta | Q6 | Delta | |------|-------|-----:|----|------:|----|------:|----|------:| | Closed | Gemini-3.1-Pro | 85.2 | 81.3 | +20.2 | 84.5 | +13.0 | 89.8 | +18.6 | | Closed | Claude Opus 4.6 | 85.1 | 84.7 | +19.2 | 81.3 | +16.6 | 89.4 | +28.9 | | Open | Gemma-4-31B-IT | 81.2 | 80.6 | +20.8 | 76.9 | +8.6 | 86.0 | +23.9 | | Closed | GPT-5.4 | 78.4 | 81.0 | +17.3 | 72.9 | +8.6 | 81.4 | +20.1 | | Closed | Gemini-3.1-Flash-Lite | 78.1 | 78.5 | +19.1 | 73.6 | +9.8 | 82.2 | +21.3 | | Closed | Claude Sonnet 4.6 | 75.9 | 77.8 | +14.4 | 71.9 | +11.5 | 78.0 | +17.9 | | Closed | GPT-5.4-Mini | 66.1 | 58.8 | +10.5 | 72.7 | +10.1 | 66.9 | +15.5 | | Closed | Claude Haiku 4.5 | 66.0 | 66.1 | +25.3 | 58.7 | +7.7 | 73.1 | +30.3 | | Open | InternVL3-38B | 56.9 | 56.2 | +28.6 | 59.5 | +10.3 | 55.1 | +14.8 | | Open | Qwen3.5-27B | 53.0 | 41.8 | -6.5 | 71.5 | +12.8 | 45.8 | -9.3 | | Open | Qwen3.5-397B-A17B | 48.0 | 37.4 | -14.6 | 64.3 | +1.8 | 42.4 | -13.2 | | Open | InternVL3-78B | 47.0 | 50.5 | +16.4 | 46.2 | -6.7 | 44.3 | -1.4 | | Open | Qwen3.5-122B-A10B | 44.6 | 36.1 | -12.9 | 62.5 | +1.3 | 35.2 | -16.7 | | Open | Qwen3.5-35B-A3B | 43.9 | 33.4 | -12.3 | 64.4 | +6.6 | 33.9 | -16.7 | | Open | Qwen3.5-9B | 43.2 | 32.8 | -6.7 | 63.0 | +5.9 | 33.9 | -18.4 | | Open | Kimi-VL-A3B | 40.6 | 34.9 | +12.8 | 54.3 | +15.4 | 32.6 | -4.8 | | Open | MiniCPM-o-4.5 | 33.0 | 19.3 | -10.5 | 50.5 | +2.6 | 29.2 | -18.9 |

Setup

pip install -r requirements.txt

Or with Nix (flakes enabled):

nix develop # dev shell with all dependencies
nix run .#zeroshot -- --help # also: .#icl, .#judge

Scripts

benchmark_zeroshot.py — Zero-shot evaluation

Sends each image + question independently (6 requests per entry).

# OpenAI API
python benchmark_zeroshot.py \
--api-base https://api.openai.com/v1/chat/completions \
--model-id gpt-5.4 \
--api-key-env OPENAI_API_KEY \
--output results_zeroshot.json

# Local vLLM / NIM endpoint
python benchmark_zeroshot.py \
--api-base http://localhost:8000/v1/chat/completions \
--model-id my-model \
--api-key dummy \
--output results_zeroshot.json

# With options
python benchmark_zeroshot.py \
--api-base https://api.openai.com/v1/chat/completions \
--model-id gpt-5.4 \
--concurrency 128 \
--limit 10 \
--output results.json

benchmark_icl.py — In-context learning (ICL) evaluation

Runs 3 questions per entry (Q3, Q5, Q6) with in-context demonstration examples.

python benchmark_icl.py \
--api-base https://api.openai.com/v1/chat/completions \
--model-id gpt-5.4 \
--api-key-env OPENAI_API_KEY \
--output results_icl.json

# For models that use…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Low stars, routine repo

NVIDIA has a repo signal matching data demand, evals and quality.