ReleaseNVIDIANVIDIApublished May 26, 2026seen 5d

NVIDIA/TensorRT-LLM v1.3.0rc16

NVIDIA/TensorRT-LLM

Open original ↗

Captured source

source ↗
published May 26, 2026seen 5dcaptured 13hhttp 200method plain

v1.3.0rc16

Repository: NVIDIA/TensorRT-LLM

Tag: v1.3.0rc16

Published: 2026-05-26T08:08:12Z

Prerelease: yes

Release notes:

Highlights

  • Model Support
  • Add Gemma4 multimodal support with native vision and audio towers (#14300)
  • Add Qwen3.5 MTP and Qwen3.6-27B-FP8 model support (#12646, #14359)
  • Add EXAONE-4.5 and Laguna model support (#12873, #13559)
  • Switch DeepSeek, NemotronH, Qwen3, and Qwen3.5-MoE to sharding-IR canonical models (#13478)
  • API
  • Refactor the VisualGenArgs API and registry (#14175)
  • Drop sink_token_length from the PyTorch attention surface (#14275)
  • Add OpenAI chat logit bias validation (#13518)
  • Reject incompatible KV connector configurations at construction time (#13577)
  • Feature
  • Add exact multimodal KV block hashing and KV cache reuse probing (#13815, #14333)
  • Add KV cache manager v2 with Python transceiver updates (#12928)
  • Add disaggregated serving support with block reuse enabled for hybrid models (#14060)
  • Add FlashInfer MLA attention backend support and SkipSoftmax sparse attention support for visual generation (#13428, #12947)
  • Add Ring Attention and unified context parallelism for VisualGen (#13821)
  • Add legacy and TensorRT-LLM 1.x modelopt quantization config support (#14088)
  • Add debugging environment variables for mamba modules (#14170)
  • Add single-rank MPI sleep/wakeup support and a rank-0 collective_rpc shim (#14052)
  • Add opentelemetry metrics for disaggregated serving with multiple postprocessing workers (#12637)
  • Support SWA scratch reuse rewind (#14412)
  • Improve FMHA, FlashInfer TRTLLM-Gen, and KV cache buffer calculation paths (#14291, #12525)
  • Improve fused-kernel and attention performance with shared-expert combine fusion, paged MQA logits decode tuning, LTX2 fused RMSNorm/RoPE, EAGLE3 dynamic tree kernel optimizations, and cu_seqlens conversion updates (#14306, #14133, #13985, #13426, #13566)
  • Optimize beam search candidate reconstruction by skipping prompt-prefix copies (#14197)
  • Update cubins to resolve the FMHA PDL issue (#14462)
  • Use CUDA 13 CUTLASS DSL package (#14354)
  • Fix
  • Fix disaggregated benchmark, usage propagation, and worker registration stability issues (#13347, #14177, #14289)
  • Fix DeepSeek-V3 OOM handling and artifacts paths (#14232)
  • Fix missing get_draft_token_length import in py_executor (#14366)
  • Fix Lora load failure handling (#13517)
  • Fix Kimi K2.5 speculative decoding behavior (#14379)
  • Fix Qwen3HybridConfig layer_types derivation and route load_hf_model_config through AutoConfig (#13832, #14410)
  • Fix CppMambaHybridCacheManager functional and performance issues (#14003)
  • Fix MTP disaggregated speculative_config coverage (#14391)
  • Fix KVCacheTransfer divide-by-zero and KV cache grain slot refinement issues (#13618, #14442)
  • Fix memory usage during refit and EPLB config model loading (#14331, #11962)
  • Fix MPI worker allocator configuration and GB300 cluster environment setup (#14152, #14460)
  • Fix profiler runner exception handling with synchronized CUDA cleanup (#13469)
  • Disable mamba replay by default (#14471)
  • Documentation
  • Add a Claude skill for multimodal model onboarding (#13842)
  • Update Gemma 4 entries in supported-models.md (#14463)
  • Fix invalid documentation and deployment guide links (#14337, #14522)
  • Benchmark
  • Add LPIPS scoring for visual generation model regression tests (#13567)
  • Add a bench_moe microbenchmark (#14507)
  • Update visual generation and accuracy thresholds for Wan 2.2, Qwen3.5-4B DFlash, and Nano V3 (#14372, #14411, #14078)
  • Disable ignore-eos when using speculative decoding in performance tests (#14347)
  • Test & Infra
  • Split verl tests into fine-grained per-case wrappers (#14037)
  • Add new stress cases (#14390)
  • Clean outdated test duration entries and remove deprecated disaggregated sampler and spark test cases (#14340, #14335, #14380)
  • Isolate ray tests to avoid GCS timeout in a single pytest session (#14342)
  • Improve L0 retry timeout budgeting and cap infra retry attempts (#14323, #14415)
  • Handle sacct errors when checking Slurm job status (#14367)
  • Fix B300 MegaMoE and MoE test selection (#14362, #14401)
  • Fix container scanning according to the latest security team guidance (#14430)
  • Deduplicate miscellaneous unit tests on B200 (#14525)

What's Changed

  • [None][chore] Update Claude Code agents and skills by @kaiyux in https://github.com/NVIDIA/TensorRT-LLM/pull/14344
  • [None][perf] Fuse sigmoid+mul+add shared-expert combine into one Trit… by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/14306
  • [None][infra] Waive 1 failed cases for main in pre-merge 38925 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14346
  • [None][infra] Revert Mingyang back to mingyangHao in allowlist by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14349
  • [None][cleanup] MistralSmall related cleanups by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/14271
  • [None][chore] Clean test_durations file by removing outdated items. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/14340
  • [None][infra] Waive 2 failed cases for main in post-merge 2725 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14357
  • [None][feat] Exact multimodal KV blockhashing by @venkywonka in https://github.com/NVIDIA/TensorRT-LLM/pull/13815
  • [None][infra] Waive 1 failed cases for main in pre-merge 38987 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14350
  • [None][feat] Update the logic of FMHA JIT path by @heyuhhh in https://github.com/NVIDIA/TensorRT-LLM/pull/14291
  • [None][feat] opentelemetry metrics for num_postproc_workers > 0 disagg by @karen-sy in https://github.com/NVIDIA/TensorRT-LLM/pull/12637
  • [TRTLLM-12385][feat] Use LPIPS score for visual gen model regression test by @yibinl-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/13567
  • [None][chore] Remove closed bugs by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14217
  • [https://nvbugs/6133201][fix] Bump GEN max_num_tokens in disagg perf YAMLs by @xwang233 in https://github.com/NVIDIA/TensorRT-LLM/pull/14191
  • [None][feat] add single-rank MPI sleep/wakeup and rank-0 collective_rpc shim by @hhzhang16 in https://github.com/NVIDIA/TensorRT-LLM/pull/14052
  • [https://nvbugs/6093911][fix] Fix disagg gen-only benchmark hang under ADP router imbalance by @chienchunhung in https://github.com/NVIDIA/TensorRT-LLM/pull/13347
  • [None][fix] Import missing get_draft_token_length in py_executor by @nv-guomingz…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Major release candidate for LLM inference optimization by NVIDIA.