ReleaseNVIDIANVIDIApublished Jun 23, 2026seen 2d

NVIDIA/TensorRT-LLM v1.3.0rc19

NVIDIA/TensorRT-LLM

Open original ↗

Captured source

source ↗
published Jun 23, 2026seen 2dcaptured 2dhttp 200method plain

v1.3.0rc19

Repository: NVIDIA/TensorRT-LLM

Tag: v1.3.0rc19

Published: 2026-06-23T16:49:25Z

Prerelease: yes

Release notes:

  • Known Issues
  • Llama 3.1 8B FP8 can hang during the autotuner warmup on GB200.
  • Model Support
  • Support NVIDIA Wan2.2-T2V quantized checkpoints (#15093)
  • Enable MTP for Step-3.7 NVFP4 and port Step-3.7VL vision tower to TRT-LLM modules (#14926)
  • Support T5 and BART in the PyTorch backend (#13919)
  • Support MiniMax-M3 in the PyTorch backend (#15292)
  • API
  • Align VisualGen serve request schema with VisualGenParams (#14733)
  • Support multi-item scoring in LLM.encode (#14693)
  • Drop legacy --extra_visual_gen_options CLI alias (#15262)
  • Feature
  • Enable TRTLLM MoE backend for Nemotron-H BF16 checkpoint (#14944)
  • Add async Ulysses pipeline (enabled for LTX-2 and WAN) (#13978)
  • Make TrtllmGenAttention the default decode backend on Blackwell+ (#14618)
  • Skip redundant data expand in DeepGemmFusedMoE via fused expand+quant Triton kernel (#14591)
  • Add Prometheus metrics for prompt cache, speculative decoding, perplexity, and batch occupancy (#12636)
  • Add Indexer TopK single-block / multi-pass radix implementation (#14268)
  • Enable gen-only speculative decoding for disagg setups (#14546)
  • Support EAGLE3 dynamic trees on Blackwell (#12958)
  • Add CUDA graph support for per-expert LoRA in Cutlass backend (#14881)
  • Add support for beam search in disaggregated serving (#14876)
  • Add maximal LLMAPI capture in usage telemetry (#14398)
  • Optimize Qwen2.5/3/3.5-VL performance (#11943)
  • Add skip-softmax TMA-load + sync-MMA warp-specialized context FMHA for sm_120/sm_121 (#15163)
  • Enable TRTLLM cross attention backend (#15345)
  • Support per-request mm_processor_kwargs for Qwen3-VL (#14702)
  • Add prefetch_reuse_blocks and configurable prefetch count (#15149)
  • Add MegaMoECuteDsl NVFP4 MoE backend (#14608)
  • Make EAGLE3 honor sampling params by default (#14745)
  • Add multiple FMHA library support to TRTLLM attention backend (#15204)
  • Add checkpointing variant of replay for MTP for mamba models (#14203)
  • Fix
  • Remove redundant TikTokenTokenizer shim from Kimi-K2.5 input processor (#14741)
  • Rename misnamed tunable_fp4_quantize kwarg and add real SF-swizzle control (#15002)
  • Gate FlashInfer GDN kernels to supported configurations (#15094)
  • Count DSA indexer K-cache correctly as UINT8 in KV cache size estimate (#15088)
  • Select CUTLASS MoE backend on non-Blackwell SMs for Qwen3.5-35B-A3B FP8 (#15081)
  • Fix SageAttention kernel regression by using static scheduler (#15047)
  • Fall back to local cache when loading tokenizer for gated models (#12998)
  • Fix PyExecutor FPM iteration timing (#14922)
  • Register multimodal placeholders for Qwen3.5 MoE VLM serving (#15079)
  • Fix and unwaive Nemotron-related bugs (#15085)
  • Guard DSA DSL atom-split against MTP draft next (#14891)
  • Scope disagg-ctx cache-transfer quorum vote to TP instead of WORLD (#15136)
  • Clear workspace in run_mla_generation to avoid illegal memory access (#15173)
  • Fix MAX_UTILIZATION reuse token budget (#15066)
  • Add kv_transfer_timeout_ms to avoid timeout (#15152)
  • Preserve ip:port for trtllm-serve visual-gen (#14355)
  • Fix guided decoding (xgrammar) + EAGLE-3 + draft_len_schedule crash during CUDA graph capture (#15023)
  • Stabilize Mamba replay state update (#14841)
  • Fix max_context_length value for attention workspace sizing (#15156)
  • Fix issue where host KV cache usage would double when speculative decoding is used (#14373)
  • Disable NCCL_SYMMETRIC tactic on GB10 (DGX Spark) (#12902)
  • Fix attentionOp FP8 MLA KV-reuse workspace calculation (#14852)
  • Fix beam search log_probs non-determinism with batch_size > 1 (#15125)
  • Forward secondary_offload_min_priority to KVCacheManager in PyTorch executor (#13768)
  • Enable multi-block mode for XQA HMMA spec-dec (#15312)
  • Fix TinyGEMM barrier bug (#15338)
  • Fix stale sparse attention kwargs (#15460)
  • Fix CppMambaHybridCacheManager to handle dp dummy request (#15054)
  • Fix embedding vocab mask for rejection sampling in Kimi-K2.5 (#15233)
  • Documentation
  • Add FLUX visual generation examples (#14987)
  • Add Qwen3.5 deployment guide doc (#15111)
  • Fix stale --disable_xqa reference in legacy docs (#13395)
  • Add Cache-DiT documentation (#15268)
  • Benchmark
  • Weight trtllm-bench AR/AL averages by output length (#14998)
  • Test & Infra
  • Add accuracy tests for nemotron-v3-ultra (#14808)
  • Remove TestLlama4ScoutInstruct tests (#15144)
  • Require minimum of 4 GPUs in llm_perf_core.yml and add new performance tests (#15090)
  • Add DFlash coverage for Qwen3.5 MoE variant (#15132)
  • Add e2e example tests for flux1/2, ltx2, wan_i2v, and cosmos3 (#15126)
  • Enable disagg cancellation stress test (#15174)
  • Fix periodic-junit in unittest pytest (#14075)
  • Update K2.5 and GLM-5 into CI perf test (#14960)
  • Add Qwen3-32B FP8 disagg stress test (#14278)
  • Sunset old disagg test cases for the QA side (#15290)
  • Add e2e Tensor Parallel LPIPS tests for VisualGen (#15208)
  • Remove TensorRT performance baseline and update to PyTorch only (#15256)
  • Add integration tests for MoE LoRA and bugfixes (#15271)

What's Changed

  • [None][infra] Waive TestQwen3NextInstruct nvfp4 cases by @mzweilz in https://github.com/NVIDIA/TensorRT-LLM/pull/15086
  • [https://nvbugs/6248757][fix] Avoid running all reduce in aux stream by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14917
  • [https://nvbugs/6221483][fix] AutoDeploy: Fix Eagle metadata host syncs by @govind-ramnarayan in https://github.com/NVIDIA/TensorRT-LLM/pull/14714
  • [None][feat] add FLUX visual generation examples by @karljang in https://github.com/NVIDIA/TensorRT-LLM/pull/14987
  • [https://nvbugs/6261164][fix] AutoDeploy: Don't allocate speculative caches when speculation is off by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/15020
  • [https://nvbugs/6211189][fix] Lower the reference to 46.5 (matching cross-GPU empirical mean) and remove the t by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14799
  • [None][refactor] split VisualGen pipeline and model configs by @bobboli in https://github.com/NVIDIA/TensorRT-LLM/pull/14956
  • [TRTLLM-11457][feat] Async Ulysses pipeline (Enabled for LTX-2 + WAN) by @luyiyun1021 in https://github.com/NVIDIA/TensorRT-LLM/pull/13978
  • [TRTLLM-11548][doc] Add Qwen3.5 deployment guide doc by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/15111
  • [https://nvbugs/6181383][fix] Build...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine release candidate for optimization library.