What does this release signal mean?

NVIDIA published NVIDIA/TensorRT-LLM v1.3.0rc17 (NVIDIA/TensorRT-LLM). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: Library for optimizing LLM inference on NVIDIA GPUs. · v1.3.0rc17 Repository: NVIDIA/TensorRT-LLM Tag: v1.3.0rc17 Published: 2026-06-02T18:50:51Z Prerelease: yes Release notes: Highlights - Known Issues - DeepSeek V3.2 will.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

NVIDIA Release: NVIDIA/TensorRT-LLM v1.3.0rc17

Captured source

source ↗

GitHub/github.com/NVIDIA/TensorRT-LLM

NVIDIA/TensorRT-LLM v1.3.0rc17

Source ↗

published Jun 2, 2026seen Jun 6captured Jun 11http 200method plain

v1.3.0rc17

Repository: NVIDIA/TensorRT-LLM

Tag: v1.3.0rc17

Published: 2026-06-02T18:50:51Z

Prerelease: yes

Release notes:

Highlights

Known Issues
DeepSeek V3.2 will crash with an illegal memory access during long-running performance tests under various agg/disagg configurations.
Model Support
Add MoT World Model support (#14012)
Enable multi-node tensor parallelism for MiniMax-M2 (#14314)
Restore Mistral Large 3 text-only processor (#14248)
Support Gemma4 multi-head_dim pools and host-side slicing for SWA Triton kernels (#13745)
Add a reasoning parser for Qwen3.5 (#14659)
Add LTX-2 Ulysses cross-attention for v2a with audio padding (#14044)
Add Poolside Laguna tool parser (#14638)
Replace Parakeet audio encoder with native TensorRT-LLM layers (#14474)
Set Mamba SSM cache to fp32 for NemotronV2 (#14448)
API
Allow content: null in CustomChatCompletionMessageParam (#14368)
Enforce trust_remote_code flag (#13527)
Add thinking token budget control (#14665)
Expose host/GPU per-iter time and clarify iter labeling in /metrics (#14127)
Make attention backend case-insensitive (#14635)
Feature
Add FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron (#13773)
Integrate the FlashInfer GDN prefill kernel for Qwen3.5 (#13644)
Add LoRA support to LLMAPI Triton backend (#14079)
Log KV cache utilization and context tokens per iteration (#14206)
Remove one-warp-per-token policy from MoE A2A kernels (#14550)
Support non-divisible expert parallelism in MoE all-to-all and Slurm benchmark (#13888)
Add CuTe DSL attention via exported binaries in VisualGen (#13721)
Enable NVFP4 KV cache support in trtllm-gen attention (#12544)
Add GMS-only weight sharing support (#13926)
Add VisualGen tensor parallelism support (#13614)
Enable NCCL symmetric zero-copy by default (#14472)
Improve disaggregated TTFT (#14719)
Fix
Restore K2.5 multimodal dep8 accuracy test on Transformers 5.5.x (#14392)
Remove sync after FlashInfer attention plan() (#14634)
Add a compatibility shim in load_hf_tokenizer for bytes_to_unicode (#14090)
Route trtllm-bench and trtllm-serve tokenizer load through TransformersTokenizer (#14452)
Fix crash in deep_ep.pyby falling back to the pre-quant dispatch path when hidden_states_sf is missing (#14404)
Fix gpt-oss accuracy issue by moving TinyGEMM PDL release after reduction (#14537)
Fix Mistral-Large-3 weight loading crash (#14033)
Bypass FlashInfer SSD prefill to fix state dtype precision (#14600)
Fix qwen3 hang on SM120/121 (#14424)
Fix NVFP4 engine size estimation and attention DP batch size in trtllm-bench (#13498)
Catch OSError in config_file_lock for NFS compatibility (#11960)
Fix MoE DeepGEMM workspace size with attention DP (#13310)
Fix inf/NaN issues in Triton Mamba softplus (#14652)
Cap per-rank max_num_active_requests by max_num_tokens under attention DP (#14481)
Propagate external SWA window to FMHA kernel in V2 KV cache (#13719)
Resolve NVML device index mismatch in get_numa_aware_cpu_affinity when CUDA_VISIBLE_DEVICES is set (#12985)
Replace fixed disagg fill throttle with slow-start ramp (#14475)
Reuse batch_indices_cuda across CUDA graph captures in EAGLE3 (#14381)
Make FA4 a proper pip dependency (#13788)
Fix GSM8K accuracy tests for LagunaXS on B200/GB200/B300 (#14580)
Documentation
Add CUTLASS DSL uninstall step to installation guide (#14621)
Add deprecation notice to legacy support-matrix.md (#14495)
Fix incorrect auto sampler behavior description for beam search (#14487)
Add VisualGen context to AGENTS.md (#14732)
Test & Infra
Update flashinfer-python from 0.6.11.post1 to 0.6.12rc2 (#14512, #14607)
Add disagg local one-step run script for CI submit (#14557)
Update model path definitions in test_perf.py and clean up waives.txt (#14393)
Dedup executor unit tests on H100/B200 (#14556)
Add disagg cancellation stress-test harness skeleton (#14375)
Add UCX TLS env in disagg-related tests (#14626)
Replace ONNX spec with onnx>=1.21.0 in requirements.txt (#14577)
Add test lists with multi-GPU tests to CI multi-GPU test trigger files (#14087)
Add offline equivalence test for sharding IR (#13963)
Enable kv_cache_manager_v2 test for A10 (#12885)
Remove two-model EAGLE3 spec-decoding tests (#14735)
Add TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENS in spec decoding perf test (#14438)

What's Changed

[https://nvbugs/6182617][fix] Restore K2.5 multimodal dep8 accuracy test on transformers 5.5.x by @tianyuxbear in https://github.com/NVIDIA/TensorRT-LLM/pull/14392
[None][feat] FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron … by @farazkh80 in https://github.com/NVIDIA/TensorRT-LLM/pull/13773
[None][perf] Integrate the flashinfer gdn prefill kernel for qwen3.5 by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/13644
[None][chore] Update flashinfer-python from 0.6.11.post1 to 0.6.12rc1 by @yihwang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14512
[https://nvbugs/6162328][fix] Add a tiny compat shim in load_hf_tokenizer that, when bytes_to_unicode is m by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14090
[https://nvbugs/6114610][test] unwaive disagg tests fixed by UCX_TLS setter by @xwang233 in https://github.com/NVIDIA/TensorRT-LLM/pull/14440
[None][fix] Route trtllm-bench and trtllm-serve tokenizer load through TransformersTokenizer by @dc3671 in https://github.com/NVIDIA/TensorRT-LLM/pull/14452
[https://nvbugs/6184914][test] Unwaive related tests by @yuxianq in https://github.com/NVIDIA/TensorRT-LLM/pull/14523
[https://nvbugs/6186880][fix] In deep_ep.py, fall back to the pre-quant dispatch path when hidden_states_sf is by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14404
[None][infra] Waive 2 failed cases for main in post-merge 2734 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14526
[None][infra] Waive 1 failed cases for main in post-merge 2735 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14542
[#11257][feat] Add LoRA support to llmapi triton backend by @karljang in https://github.com/NVIDIA/TensorRT-LLM/pull/14079
[None][chore] Include layer_idx in MoE backend fallback warnings by @dc3671 in https://github.com/NVIDIA/TensorRT-LLM/pull/13409
[None][chore] Add disagg local one-step run script for CI submit by @fredricz-20070104 in...

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Major update to popular LLM inference library