NVIDIA/TensorRT-LLM v1.3.0rc19
NVIDIA/TensorRT-LLM
Captured source
source ↗published Jun 23, 2026seen 2dcaptured 2dhttp 200method plain
v1.3.0rc19
Repository: NVIDIA/TensorRT-LLM
Tag: v1.3.0rc19
Published: 2026-06-23T16:49:25Z
Prerelease: yes
Release notes:
- Known Issues
- Llama 3.1 8B FP8 can hang during the autotuner warmup on GB200.
- Model Support
- Support NVIDIA Wan2.2-T2V quantized checkpoints (#15093)
- Enable MTP for Step-3.7 NVFP4 and port Step-3.7VL vision tower to TRT-LLM modules (#14926)
- Support T5 and BART in the PyTorch backend (#13919)
- Support MiniMax-M3 in the PyTorch backend (#15292)
- API
- Align VisualGen serve request schema with
VisualGenParams(#14733) - Support multi-item scoring in
LLM.encode(#14693) - Drop legacy
--extra_visual_gen_optionsCLI alias (#15262)
- Feature
- Enable TRTLLM MoE backend for Nemotron-H BF16 checkpoint (#14944)
- Add async Ulysses pipeline (enabled for LTX-2 and WAN) (#13978)
- Make
TrtllmGenAttentionthe default decode backend on Blackwell+ (#14618) - Skip redundant data expand in
DeepGemmFusedMoEvia fused expand+quant Triton kernel (#14591) - Add Prometheus metrics for prompt cache, speculative decoding, perplexity, and batch occupancy (#12636)
- Add Indexer TopK single-block / multi-pass radix implementation (#14268)
- Enable gen-only speculative decoding for disagg setups (#14546)
- Support EAGLE3 dynamic trees on Blackwell (#12958)
- Add CUDA graph support for per-expert LoRA in Cutlass backend (#14881)
- Add support for beam search in disaggregated serving (#14876)
- Add maximal LLMAPI capture in usage telemetry (#14398)
- Optimize Qwen2.5/3/3.5-VL performance (#11943)
- Add skip-softmax TMA-load + sync-MMA warp-specialized context FMHA for sm_120/sm_121 (#15163)
- Enable TRTLLM cross attention backend (#15345)
- Support per-request
mm_processor_kwargsfor Qwen3-VL (#14702) - Add
prefetch_reuse_blocksand configurable prefetch count (#15149) - Add MegaMoECuteDsl NVFP4 MoE backend (#14608)
- Make EAGLE3 honor sampling params by default (#14745)
- Add multiple FMHA library support to TRTLLM attention backend (#15204)
- Add checkpointing variant of replay for MTP for mamba models (#14203)
- Fix
- Remove redundant
TikTokenTokenizershim from Kimi-K2.5 input processor (#14741) - Rename misnamed
tunable_fp4_quantizekwarg and add real SF-swizzle control (#15002) - Gate FlashInfer GDN kernels to supported configurations (#15094)
- Count DSA indexer K-cache correctly as UINT8 in KV cache size estimate (#15088)
- Select CUTLASS MoE backend on non-Blackwell SMs for Qwen3.5-35B-A3B FP8 (#15081)
- Fix SageAttention kernel regression by using static scheduler (#15047)
- Fall back to local cache when loading tokenizer for gated models (#12998)
- Fix PyExecutor FPM iteration timing (#14922)
- Register multimodal placeholders for Qwen3.5 MoE VLM serving (#15079)
- Fix and unwaive Nemotron-related bugs (#15085)
- Guard DSA DSL atom-split against MTP draft next (#14891)
- Scope disagg-ctx cache-transfer quorum vote to TP instead of WORLD (#15136)
- Clear workspace in
run_mla_generationto avoid illegal memory access (#15173) - Fix
MAX_UTILIZATIONreuse token budget (#15066) - Add
kv_transfer_timeout_msto avoid timeout (#15152) - Preserve ip:port for
trtllm-servevisual-gen (#14355) - Fix guided decoding (xgrammar) + EAGLE-3 +
draft_len_schedulecrash during CUDA graph capture (#15023) - Stabilize Mamba replay state update (#14841)
- Fix
max_context_lengthvalue for attention workspace sizing (#15156) - Fix issue where host KV cache usage would double when speculative decoding is used (#14373)
- Disable
NCCL_SYMMETRICtactic on GB10 (DGX Spark) (#12902) - Fix
attentionOpFP8 MLA KV-reuse workspace calculation (#14852) - Fix beam search
log_probsnon-determinism withbatch_size > 1(#15125) - Forward
secondary_offload_min_prioritytoKVCacheManagerin PyTorch executor (#13768) - Enable multi-block mode for XQA HMMA spec-dec (#15312)
- Fix TinyGEMM barrier bug (#15338)
- Fix stale sparse attention kwargs (#15460)
- Fix
CppMambaHybridCacheManagerto handle dp dummy request (#15054) - Fix embedding vocab mask for rejection sampling in Kimi-K2.5 (#15233)
- Documentation
- Add FLUX visual generation examples (#14987)
- Add Qwen3.5 deployment guide doc (#15111)
- Fix stale
--disable_xqareference in legacy docs (#13395) - Add Cache-DiT documentation (#15268)
- Benchmark
- Weight trtllm-bench AR/AL averages by output length (#14998)
- Test & Infra
- Add accuracy tests for nemotron-v3-ultra (#14808)
- Remove
TestLlama4ScoutInstructtests (#15144) - Require minimum of 4 GPUs in
llm_perf_core.ymland add new performance tests (#15090) - Add DFlash coverage for Qwen3.5 MoE variant (#15132)
- Add e2e example tests for flux1/2, ltx2, wan_i2v, and cosmos3 (#15126)
- Enable disagg cancellation stress test (#15174)
- Fix periodic-junit in unittest pytest (#14075)
- Update K2.5 and GLM-5 into CI perf test (#14960)
- Add Qwen3-32B FP8 disagg stress test (#14278)
- Sunset old disagg test cases for the QA side (#15290)
- Add e2e Tensor Parallel LPIPS tests for VisualGen (#15208)
- Remove TensorRT performance baseline and update to PyTorch only (#15256)
- Add integration tests for MoE LoRA and bugfixes (#15271)
What's Changed
- [None][infra] Waive TestQwen3NextInstruct nvfp4 cases by @mzweilz in https://github.com/NVIDIA/TensorRT-LLM/pull/15086
- [https://nvbugs/6248757][fix] Avoid running all reduce in aux stream by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14917
- [https://nvbugs/6221483][fix] AutoDeploy: Fix Eagle metadata host syncs by @govind-ramnarayan in https://github.com/NVIDIA/TensorRT-LLM/pull/14714
- [None][feat] add FLUX visual generation examples by @karljang in https://github.com/NVIDIA/TensorRT-LLM/pull/14987
- [https://nvbugs/6261164][fix] AutoDeploy: Don't allocate speculative caches when speculation is off by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/15020
- [https://nvbugs/6211189][fix] Lower the reference to 46.5 (matching cross-GPU empirical mean) and remove the t by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14799
- [None][refactor] split VisualGen pipeline and model configs by @bobboli in https://github.com/NVIDIA/TensorRT-LLM/pull/14956
- [TRTLLM-11457][feat] Async Ulysses pipeline (Enabled for LTX-2 + WAN) by @luyiyun1021 in https://github.com/NVIDIA/TensorRT-LLM/pull/13978
- [TRTLLM-11548][doc] Add Qwen3.5 deployment guide doc by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/15111
- [https://nvbugs/6181383][fix] Build...
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine release candidate for optimization library.