ReleaseNVIDIANVIDIApublished Jun 2, 2026seen 5d

NVIDIA/TensorRT-LLM v1.3.0rc17

NVIDIA/TensorRT-LLM

Open original ↗

Captured source

source ↗
published Jun 2, 2026seen 5dcaptured 10hhttp 200method plain

v1.3.0rc17

Repository: NVIDIA/TensorRT-LLM

Tag: v1.3.0rc17

Published: 2026-06-02T18:50:51Z

Prerelease: yes

Release notes:

Highlights

  • Known Issues
  • DeepSeek V3.2 will crash with an illegal memory access during long-running performance tests under various agg/disagg configurations.
  • Model Support
  • Add MoT World Model support (#14012)
  • Enable multi-node tensor parallelism for MiniMax-M2 (#14314)
  • Restore Mistral Large 3 text-only processor (#14248)
  • Support Gemma4 multi-head_dim pools and host-side slicing for SWA Triton kernels (#13745)
  • Add a reasoning parser for Qwen3.5 (#14659)
  • Add LTX-2 Ulysses cross-attention for v2a with audio padding (#14044)
  • Add Poolside Laguna tool parser (#14638)
  • Replace Parakeet audio encoder with native TensorRT-LLM layers (#14474)
  • Set Mamba SSM cache to fp32 for NemotronV2 (#14448)
  • API
  • Allow content: null in CustomChatCompletionMessageParam (#14368)
  • Enforce trust_remote_code flag (#13527)
  • Add thinking token budget control (#14665)
  • Expose host/GPU per-iter time and clarify iter labeling in /metrics (#14127)
  • Make attention backend case-insensitive (#14635)
  • Feature
  • Add FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron (#13773)
  • Integrate the FlashInfer GDN prefill kernel for Qwen3.5 (#13644)
  • Add LoRA support to LLMAPI Triton backend (#14079)
  • Log KV cache utilization and context tokens per iteration (#14206)
  • Remove one-warp-per-token policy from MoE A2A kernels (#14550)
  • Support non-divisible expert parallelism in MoE all-to-all and Slurm benchmark (#13888)
  • Add CuTe DSL attention via exported binaries in VisualGen (#13721)
  • Enable NVFP4 KV cache support in trtllm-gen attention (#12544)
  • Add GMS-only weight sharing support (#13926)
  • Add VisualGen tensor parallelism support (#13614)
  • Enable NCCL symmetric zero-copy by default (#14472)
  • Improve disaggregated TTFT (#14719)
  • Fix
  • Restore K2.5 multimodal dep8 accuracy test on Transformers 5.5.x (#14392)
  • Remove sync after FlashInfer attention plan() (#14634)
  • Add a compatibility shim in load_hf_tokenizer for bytes_to_unicode (#14090)
  • Route trtllm-bench and trtllm-serve tokenizer load through TransformersTokenizer (#14452)
  • Fix crash in deep_ep.pyby falling back to the pre-quant dispatch path when hidden_states_sf is missing (#14404)
  • Fix gpt-oss accuracy issue by moving TinyGEMM PDL release after reduction (#14537)
  • Fix Mistral-Large-3 weight loading crash (#14033)
  • Bypass FlashInfer SSD prefill to fix state dtype precision (#14600)
  • Fix qwen3 hang on SM120/121 (#14424)
  • Fix NVFP4 engine size estimation and attention DP batch size in trtllm-bench (#13498)
  • Catch OSError in config_file_lock for NFS compatibility (#11960)
  • Fix MoE DeepGEMM workspace size with attention DP (#13310)
  • Fix inf/NaN issues in Triton Mamba softplus (#14652)
  • Cap per-rank max_num_active_requests by max_num_tokens under attention DP (#14481)
  • Propagate external SWA window to FMHA kernel in V2 KV cache (#13719)
  • Resolve NVML device index mismatch in get_numa_aware_cpu_affinity when CUDA_VISIBLE_DEVICES is set (#12985)
  • Replace fixed disagg fill throttle with slow-start ramp (#14475)
  • Reuse batch_indices_cuda across CUDA graph captures in EAGLE3 (#14381)
  • Make FA4 a proper pip dependency (#13788)
  • Fix GSM8K accuracy tests for LagunaXS on B200/GB200/B300 (#14580)
  • Documentation
  • Add CUTLASS DSL uninstall step to installation guide (#14621)
  • Add deprecation notice to legacy support-matrix.md (#14495)
  • Fix incorrect auto sampler behavior description for beam search (#14487)
  • Add VisualGen context to AGENTS.md (#14732)
  • Test & Infra
  • Update flashinfer-python from 0.6.11.post1 to 0.6.12rc2 (#14512, #14607)
  • Add disagg local one-step run script for CI submit (#14557)
  • Update model path definitions in test_perf.py and clean up waives.txt (#14393)
  • Dedup executor unit tests on H100/B200 (#14556)
  • Add disagg cancellation stress-test harness skeleton (#14375)
  • Add UCX TLS env in disagg-related tests (#14626)
  • Replace ONNX spec with onnx>=1.21.0 in requirements.txt (#14577)
  • Add test lists with multi-GPU tests to CI multi-GPU test trigger files (#14087)
  • Add offline equivalence test for sharding IR (#13963)
  • Enable kv_cache_manager_v2 test for A10 (#12885)
  • Remove two-model EAGLE3 spec-decoding tests (#14735)
  • Add TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENS in spec decoding perf test (#14438)

What's Changed

  • [https://nvbugs/6182617][fix] Restore K2.5 multimodal dep8 accuracy test on transformers 5.5.x by @tianyuxbear in https://github.com/NVIDIA/TensorRT-LLM/pull/14392
  • [None][feat] FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron … by @farazkh80 in https://github.com/NVIDIA/TensorRT-LLM/pull/13773
  • [None][perf] Integrate the flashinfer gdn prefill kernel for qwen3.5 by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/13644
  • [None][chore] Update flashinfer-python from 0.6.11.post1 to 0.6.12rc1 by @yihwang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14512
  • [https://nvbugs/6162328][fix] Add a tiny compat shim in load_hf_tokenizer that, when bytes_to_unicode is m by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14090
  • [https://nvbugs/6114610][test] unwaive disagg tests fixed by UCX_TLS setter by @xwang233 in https://github.com/NVIDIA/TensorRT-LLM/pull/14440
  • [None][fix] Route trtllm-bench and trtllm-serve tokenizer load through TransformersTokenizer by @dc3671 in https://github.com/NVIDIA/TensorRT-LLM/pull/14452
  • [https://nvbugs/6184914][test] Unwaive related tests by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/14523
  • [https://nvbugs/6186880][fix] In deep_ep.py, fall back to the pre-quant dispatch path when hidden_states_sf is by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14404
  • [None][infra] Waive 2 failed cases for main in post-merge 2734 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14526
  • [None][infra] Waive 1 failed cases for main in post-merge 2735 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14542
  • [#11257][feat] Add LoRA support to llmapi triton backend by @karljang in https://github.com/NVIDIA/TensorRT-LLM/pull/14079
  • [None][chore] Include layer_idx in MoE backend fallback warnings by @dc3671 in https://github.com/NVIDIA/TensorRT-LLM/pull/13409
  • [None][chore] Add disagg local one-step run script for CI submit by @fredricz-20070104 in…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Major update to popular LLM inference library