RepoBasetenBasetenpublished Apr 1, 2026seen 5d

basetenlabs/qwen3-nvfp4-benchmark

Python

Open original ↗

Captured source

source ↗

basetenlabs/qwen3-nvfp4-benchmark

Description: Qwen3 NVFP4 quantization benchmarks vs Bonsai 1-bit Pareto frontier

Language: Python

Stars: 0

Forks: 0

Open issues: 0

Created: 2026-04-01T21:03:59Z

Pushed: 2026-04-01T21:04:06Z

Default branch: main

Fork: no

Archived: no

README:

Qwen3 NVFP4 Quantization Benchmark

NVIDIA NVFP4 (4-bit floating point) quantization of Qwen3 dense models, benchmarked against PrismML's Bonsai 1-bit Pareto frontier.

Results

All models evaluated on the same 6-benchmark suite with greedy decoding (temperature=0, enable_thinking=false, seed=42) on NVIDIA B200 GPUs using vLLM + EvalScope.

Full Comparison Table

┌───────────────────────────┬──────────┬────────┬──────┬───────┬──────┬────────┬──────┬──────┐
│ Model │ Size(GB) │ MMLU-R │ MuSR │ GSM8K │ HE+ │ IFEval │ BFCL │ AVG │
├───────────────────────────┼──────────┼────────┼──────┼───────┼──────┼────────┼──────┼──────┤
│ Qwen3-8B BF16 │ 16.4 │ 81.1 │ 56.5 │ 92.7 │ 82.3 │ 86.6 │ 67.8 │ 77.8 │
│ Qwen3-8B NVFP4 PTQ │ 6.4 │ 79.6 │ 55.8 │ 92.1 │ 73.2 │ 85.0 │ 65.5 │ 75.2 │
│ Bonsai 8B (1-bit) │ 1.15 │ 65.2 │ 48.9 │ 85.1 │ 71.3 │ 84.6 │ 54.8 │ 68.3 │
├───────────────────────────┼──────────┼────────┼──────┼───────┼──────┼────────┼──────┼──────┤
│ Qwen3-4B BF16 │ 8.0 │ 79.4 │ 56.7 │ 90.9 │ 73.2 │ 84.3 │ 65.8 │ 75.1 │
│ Qwen3-4B NVFP4 PTQ │ 2.6 │ 75.0 │ 55.0 │ 87.7 │ 70.1 │ 80.5 │ 62.2 │ 71.8 │
│ Qwen3-4B NVFP4 QAD 2K │ 2.6 │ 60.2 │ 40.0 │ 81.5 │ 57.3 │ 49.5 │ 55.2 │ 57.3 │
│ Qwen3-4B NVFP4 QAD 10K │ 2.6 │ — │ — │ — │ — │ — │ — │ — │
│ Bonsai 4B (1-bit) │ 0.57 │ 58.7 │ 38.5 │ 85.7 │ 72.6 │ 77.8 │ 46.2 │ 63.2 │
├───────────────────────────┼──────────┼────────┼──────┼───────┼──────┼────────┼──────┼──────┤
│ Qwen3-1.7B NVFP4 PTQ │ 1.4 │ 59.9 │ 48.3 │ 71.0 │ 15.9 │ 66.5 │ 50.0 │ 52.0 │
│ Qwen3-1.7B NVFP4 QAD 2K* │ 1.4 │ 57.7 │ 46.3 │ 68.8 │ 40.8 │ 57.5 │ 47.1 │ 53.0 │
│ Qwen3-1.7B NVFP4 QAD 10K │ 1.4 │ — │ — │ — │ — │ — │ — │ — │
└───────────────────────────┴──────────┴────────┴──────┴───────┴──────┴────────┴──────┴──────┘

* QAD 2K 1.7B was benchmarked with thinking mode ON (not comparable to other rows).
QAD 10K runs in progress — results pending.

Key Findings

1. NVFP4 PTQ retains 95%+ quality. Qwen3-4B PTQ scores 71.8 vs 75.1 BF16 (95.6% retention) at 2.6x compression. Qwen3-8B PTQ scores 75.2 vs 77.8 (96.7%).

2. QAD with insufficient training HURTS. QAD 2K (2000 steps, 8K total samples) dropped 4B from 71.8 (PTQ) to 57.3 — a catastrophic 14.5-point degradation. The training was too short.

3. Embedding layer dominates small model size. Qwen3's 152K vocabulary creates a 622 MB (1.7B) or 778 MB (4B) embedding that stays in BF16. This is 44% and 30% of the quantized file size respectively.

4. 1.7B models score low on HumanEval+. The 15.9% score is real — verified by inspecting generations. The model produces reasonable code attempts but fails edge cases.

Benchmark Suite

Matches PrismML's Bonsai whitepaper evaluation:

| # | Benchmark | Tests | Metric | |---|-----------|-------|--------| | 1 | MMLU-Redux | 57 subjects | Accuracy | | 2 | MuSR | 756 multistep reasoning questions | Accuracy | | 3 | GSM8K | 1,319 math problems | Exact match | | 4 | HumanEval+ | 164 code problems (Docker sandbox) | pass@1 | | 5 | IFEval | 541 instruction-following prompts | (prompt_strict + inst_strict) / 2 | | 6 | BFCLv3 | 13+ tool-calling subsets | Macro-average |

Final score = arithmetic mean of all 6.

Checkpoints

| Model | HuggingFace | Size | How Made | |-------|-------------|------|----------| | Qwen3-8B NVFP4 PTQ | nvidia/Qwen3-8B-NVFP4 | 6.4 GB | NVIDIA official PTQ | | Qwen3-4B NVFP4 PTQ | baseten/Qwen3-4B-NVFP4-PTQ | 2.6 GB | PTQ with modelopt NVFP4_DEFAULT_CFG | | Qwen3-1.7B NVFP4 PTQ | baseten/Qwen3-1.7B-NVFP4-PTQ | 1.4 GB | PTQ with modelopt NVFP4_DEFAULT_CFG | | Qwen3-4B NVFP4 QAD | baseten/Qwen3-4B-NVFP4-QAD | 2.6 GB | QAD 2K steps (undertrained) | | Qwen3-1.7B NVFP4 QAD | baseten/Qwen3-1.7B-NVFP4-QAD | 1.4 GB | QAD 2K steps (undertrained) |

Quantization Details

NVFP4 (E2M1):

  • 4 bits per weight (16 representable values per sign)
  • Block size 16: every 16 weights share an FP8 E4M3 scale factor
  • Effective: 4.5 bits/weight for quantized layers
  • Layers skipped: embed_tokens, lm_head, RMSNorm, MoE routers

Size breakdown (Qwen3-4B):

| Component | Params | Format | Size | % | |-----------|--------|--------|------|---| | embed_tokens | 389M | BF16 | 778 MB | 30% | | 36 layers x 7 linears | 3,279M | NVFP4 | 1,845 MB | 70% | | Total | 3,668M | | 2.63 GB | |

Quick Start

Run a benchmark

pip install "vllm>=0.16.0" evalscope==1.4.2 'evalscope[sandbox]'

python benchmark.py \
--model-path /path/to/exported_checkpoint \
--model-name my-model \
--gpu 0 \
--quantization modelopt \
--model-size-gb 2.6

Produce an NVFP4 PTQ checkpoint

cd /path/to/Model-Optimizer/examples/llm_ptq
CUDA_VISIBLE_DEVICES=0 python hf_ptq.py \
--pyt_ckpt_path Qwen/Qwen3-4B \
--qformat nvfp4 \
--calib_size 512 \
--batch_size 4 \
--export_path ./qwen3-4b-nvfp4-ptq

Produce an NVFP4 QAD checkpoint

cd /path/to/Model-Optimizer/examples/llm_qat
bash launch.sh \
--model Qwen/Qwen3-4B \
--teacher_model Qwen/Qwen3-4B \
--distill True \
--quant_cfg NVFP4_DEFAULT_CFG \
--output_dir ./qwen3-4b-qad \
--max_steps 10000 \
--lr 2e-5 \
--fsdp_transformer_layer_cls_to_wrap Qwen3DecoderLayer

# Export for serving
python export.py --pyt_ckpt_path ./qwen3-4b-qad --export_path ./qwen3-4b-qad-exported

Serve with vLLM

vllm serve ./qwen3-4b-nvfp4-ptq \
--quantization modelopt \
--reasoning-parser qwen3 \
--default-chat-template-kwargs '{"enable_thinking": false}'

Hardware

All experiments run on 8x NVIDIA B200 (192 GB HBM3e each). NVFP4 W4A4 runs natively on Blackwell SM100.

Method

PTQ (Post-Training Quantization)

Calibrate quantization scales on 512 samples, export to packed FP4. No training. Takes ~30 minutes.

QAD (Quantization-Aware Distillation)

1. Quantize student model to NVFP4 (same as PTQ calibration) 2. Train with KL divergence loss against frozen BF16 teacher 3. Uses HuggingFace QADTrainer from NVIDIA Model Optimizer 4. Dataset: nvidia/Daring-Anteater (~98K samples) 5. Export to packed FP4 for deployment

Known Issues

  • QAD 2K undertrained: 2000 steps with batch_size=4 = only 8K samples. NVIDIA's reference uses 100K+ samples. Insufficient training degrades below PTQ baseline.…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New benchmark for Qwen3 NVFP4.