NVIDIA/TensorRT-LLM v1.3.0rc17
NVIDIA/TensorRT-LLM
Captured source
source ↗published Jun 2, 2026seen 5dcaptured 10hhttp 200method plain
v1.3.0rc17
Repository: NVIDIA/TensorRT-LLM
Tag: v1.3.0rc17
Published: 2026-06-02T18:50:51Z
Prerelease: yes
Release notes:
Highlights
- Known Issues
- DeepSeek V3.2 will crash with an illegal memory access during long-running performance tests under various agg/disagg configurations.
- Model Support
- Add MoT World Model support (#14012)
- Enable multi-node tensor parallelism for MiniMax-M2 (#14314)
- Restore Mistral Large 3 text-only processor (#14248)
- Support Gemma4 multi-head_dim pools and host-side slicing for SWA Triton kernels (#13745)
- Add a reasoning parser for Qwen3.5 (#14659)
- Add LTX-2 Ulysses cross-attention for v2a with audio padding (#14044)
- Add Poolside Laguna tool parser (#14638)
- Replace Parakeet audio encoder with native TensorRT-LLM layers (#14474)
- Set Mamba SSM cache to fp32 for NemotronV2 (#14448)
- API
- Allow
content: nullinCustomChatCompletionMessageParam(#14368) - Enforce
trust_remote_codeflag (#13527) - Add thinking token budget control (#14665)
- Expose host/GPU per-iter time and clarify iter labeling in
/metrics(#14127) - Make attention backend case-insensitive (#14635)
- Feature
- Add FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron (#13773)
- Integrate the FlashInfer GDN prefill kernel for Qwen3.5 (#13644)
- Add LoRA support to LLMAPI Triton backend (#14079)
- Log KV cache utilization and context tokens per iteration (#14206)
- Remove one-warp-per-token policy from MoE A2A kernels (#14550)
- Support non-divisible expert parallelism in MoE all-to-all and Slurm benchmark (#13888)
- Add CuTe DSL attention via exported binaries in VisualGen (#13721)
- Enable NVFP4 KV cache support in trtllm-gen attention (#12544)
- Add GMS-only weight sharing support (#13926)
- Add VisualGen tensor parallelism support (#13614)
- Enable NCCL symmetric zero-copy by default (#14472)
- Improve disaggregated TTFT (#14719)
- Fix
- Restore K2.5 multimodal dep8 accuracy test on Transformers 5.5.x (#14392)
- Remove sync after FlashInfer attention
plan()(#14634) - Add a compatibility shim in
load_hf_tokenizerforbytes_to_unicode(#14090) - Route
trtllm-benchandtrtllm-servetokenizer load throughTransformersTokenizer(#14452) - Fix crash in
deep_ep.pyby falling back to the pre-quant dispatch path whenhidden_states_sfis missing (#14404) - Fix gpt-oss accuracy issue by moving TinyGEMM PDL release after reduction (#14537)
- Fix Mistral-Large-3 weight loading crash (#14033)
- Bypass FlashInfer SSD prefill to fix state dtype precision (#14600)
- Fix qwen3 hang on SM120/121 (#14424)
- Fix NVFP4 engine size estimation and attention DP batch size in
trtllm-bench(#13498) - Catch
OSErrorinconfig_file_lockfor NFS compatibility (#11960) - Fix MoE DeepGEMM workspace size with attention DP (#13310)
- Fix inf/NaN issues in Triton Mamba softplus (#14652)
- Cap per-rank
max_num_active_requestsbymax_num_tokensunder attention DP (#14481) - Propagate external SWA window to FMHA kernel in V2 KV cache (#13719)
- Resolve NVML device index mismatch in
get_numa_aware_cpu_affinitywhenCUDA_VISIBLE_DEVICESis set (#12985) - Replace fixed disagg fill throttle with slow-start ramp (#14475)
- Reuse
batch_indices_cudaacross CUDA graph captures in EAGLE3 (#14381) - Make FA4 a proper pip dependency (#13788)
- Fix GSM8K accuracy tests for LagunaXS on B200/GB200/B300 (#14580)
- Documentation
- Add CUTLASS DSL uninstall step to installation guide (#14621)
- Add deprecation notice to legacy support-matrix.md (#14495)
- Fix incorrect auto sampler behavior description for beam search (#14487)
- Add VisualGen context to AGENTS.md (#14732)
- Test & Infra
- Update flashinfer-python from 0.6.11.post1 to 0.6.12rc2 (#14512, #14607)
- Add disagg local one-step run script for CI submit (#14557)
- Update model path definitions in
test_perf.pyand clean upwaives.txt(#14393) - Dedup executor unit tests on H100/B200 (#14556)
- Add disagg cancellation stress-test harness skeleton (#14375)
- Add UCX TLS env in disagg-related tests (#14626)
- Replace ONNX spec with
onnx>=1.21.0inrequirements.txt(#14577) - Add test lists with multi-GPU tests to CI multi-GPU test trigger files (#14087)
- Add offline equivalence test for sharding IR (#13963)
- Enable
kv_cache_manager_v2test for A10 (#12885) - Remove two-model EAGLE3 spec-decoding tests (#14735)
- Add
TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENSin spec decoding perf test (#14438)
What's Changed
- [https://nvbugs/6182617][fix] Restore K2.5 multimodal dep8 accuracy test on transformers 5.5.x by @tianyuxbear in https://github.com/NVIDIA/TensorRT-LLM/pull/14392
- [None][feat] FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron … by @farazkh80 in https://github.com/NVIDIA/TensorRT-LLM/pull/13773
- [None][perf] Integrate the flashinfer gdn prefill kernel for qwen3.5 by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/13644
- [None][chore] Update flashinfer-python from 0.6.11.post1 to 0.6.12rc1 by @yihwang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14512
- [https://nvbugs/6162328][fix] Add a tiny compat shim in
load_hf_tokenizerthat, whenbytes_to_unicodeis m by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14090 - [https://nvbugs/6114610][test] unwaive disagg tests fixed by UCX_TLS setter by @xwang233 in https://github.com/NVIDIA/TensorRT-LLM/pull/14440
- [None][fix] Route trtllm-bench and trtllm-serve tokenizer load through TransformersTokenizer by @dc3671 in https://github.com/NVIDIA/TensorRT-LLM/pull/14452
- [https://nvbugs/6184914][test] Unwaive related tests by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/14523
- [https://nvbugs/6186880][fix] In deep_ep.py, fall back to the pre-quant dispatch path when hidden_states_sf is by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14404
- [None][infra] Waive 2 failed cases for main in post-merge 2734 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14526
- [None][infra] Waive 1 failed cases for main in post-merge 2735 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14542
- [#11257][feat] Add LoRA support to llmapi triton backend by @karljang in https://github.com/NVIDIA/TensorRT-LLM/pull/14079
- [None][chore] Include layer_idx in MoE backend fallback warnings by @dc3671 in https://github.com/NVIDIA/TensorRT-LLM/pull/13409
- [None][chore] Add disagg local one-step run script for CI submit by @fredricz-20070104 in…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Major update to popular LLM inference library