NVIDIA/TensorRT-LLM v1.3.0rc16
NVIDIA/TensorRT-LLM
Captured source
source ↗published May 26, 2026seen 5dcaptured 13hhttp 200method plain
v1.3.0rc16
Repository: NVIDIA/TensorRT-LLM
Tag: v1.3.0rc16
Published: 2026-05-26T08:08:12Z
Prerelease: yes
Release notes:
Highlights
- Model Support
- Add Gemma4 multimodal support with native vision and audio towers (#14300)
- Add Qwen3.5 MTP and Qwen3.6-27B-FP8 model support (#12646, #14359)
- Add EXAONE-4.5 and Laguna model support (#12873, #13559)
- Switch DeepSeek, NemotronH, Qwen3, and Qwen3.5-MoE to sharding-IR canonical models (#13478)
- API
- Refactor the VisualGenArgs API and registry (#14175)
- Drop sink_token_length from the PyTorch attention surface (#14275)
- Add OpenAI chat logit bias validation (#13518)
- Reject incompatible KV connector configurations at construction time (#13577)
- Feature
- Add exact multimodal KV block hashing and KV cache reuse probing (#13815, #14333)
- Add KV cache manager v2 with Python transceiver updates (#12928)
- Add disaggregated serving support with block reuse enabled for hybrid models (#14060)
- Add FlashInfer MLA attention backend support and SkipSoftmax sparse attention support for visual generation (#13428, #12947)
- Add Ring Attention and unified context parallelism for VisualGen (#13821)
- Add legacy and TensorRT-LLM 1.x modelopt quantization config support (#14088)
- Add debugging environment variables for mamba modules (#14170)
- Add single-rank MPI sleep/wakeup support and a rank-0 collective_rpc shim (#14052)
- Add opentelemetry metrics for disaggregated serving with multiple postprocessing workers (#12637)
- Support SWA scratch reuse rewind (#14412)
- Improve FMHA, FlashInfer TRTLLM-Gen, and KV cache buffer calculation paths (#14291, #12525)
- Improve fused-kernel and attention performance with shared-expert combine fusion, paged MQA logits decode tuning, LTX2 fused RMSNorm/RoPE, EAGLE3 dynamic tree kernel optimizations, and cu_seqlens conversion updates (#14306, #14133, #13985, #13426, #13566)
- Optimize beam search candidate reconstruction by skipping prompt-prefix copies (#14197)
- Update cubins to resolve the FMHA PDL issue (#14462)
- Use CUDA 13 CUTLASS DSL package (#14354)
- Fix
- Fix disaggregated benchmark, usage propagation, and worker registration stability issues (#13347, #14177, #14289)
- Fix DeepSeek-V3 OOM handling and artifacts paths (#14232)
- Fix missing get_draft_token_length import in py_executor (#14366)
- Fix Lora load failure handling (#13517)
- Fix Kimi K2.5 speculative decoding behavior (#14379)
- Fix Qwen3HybridConfig layer_types derivation and route load_hf_model_config through AutoConfig (#13832, #14410)
- Fix CppMambaHybridCacheManager functional and performance issues (#14003)
- Fix MTP disaggregated speculative_config coverage (#14391)
- Fix KVCacheTransfer divide-by-zero and KV cache grain slot refinement issues (#13618, #14442)
- Fix memory usage during refit and EPLB config model loading (#14331, #11962)
- Fix MPI worker allocator configuration and GB300 cluster environment setup (#14152, #14460)
- Fix profiler runner exception handling with synchronized CUDA cleanup (#13469)
- Disable mamba replay by default (#14471)
- Documentation
- Add a Claude skill for multimodal model onboarding (#13842)
- Update Gemma 4 entries in supported-models.md (#14463)
- Fix invalid documentation and deployment guide links (#14337, #14522)
- Benchmark
- Add LPIPS scoring for visual generation model regression tests (#13567)
- Add a bench_moe microbenchmark (#14507)
- Update visual generation and accuracy thresholds for Wan 2.2, Qwen3.5-4B DFlash, and Nano V3 (#14372, #14411, #14078)
- Disable ignore-eos when using speculative decoding in performance tests (#14347)
- Test & Infra
- Split verl tests into fine-grained per-case wrappers (#14037)
- Add new stress cases (#14390)
- Clean outdated test duration entries and remove deprecated disaggregated sampler and spark test cases (#14340, #14335, #14380)
- Isolate ray tests to avoid GCS timeout in a single pytest session (#14342)
- Improve L0 retry timeout budgeting and cap infra retry attempts (#14323, #14415)
- Handle sacct errors when checking Slurm job status (#14367)
- Fix B300 MegaMoE and MoE test selection (#14362, #14401)
- Fix container scanning according to the latest security team guidance (#14430)
- Deduplicate miscellaneous unit tests on B200 (#14525)
What's Changed
- [None][chore] Update Claude Code agents and skills by @kaiyux in https://github.com/NVIDIA/TensorRT-LLM/pull/14344
- [None][perf] Fuse sigmoid+mul+add shared-expert combine into one Trit… by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/14306
- [None][infra] Waive 1 failed cases for main in pre-merge 38925 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14346
- [None][infra] Revert Mingyang back to mingyangHao in allowlist by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14349
- [None][cleanup] MistralSmall related cleanups by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/14271
- [None][chore] Clean test_durations file by removing outdated items. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/14340
- [None][infra] Waive 2 failed cases for main in post-merge 2725 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14357
- [None][feat] Exact multimodal KV blockhashing by @venkywonka in https://github.com/NVIDIA/TensorRT-LLM/pull/13815
- [None][infra] Waive 1 failed cases for main in pre-merge 38987 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14350
- [None][feat] Update the logic of FMHA JIT path by @heyuhhh in https://github.com/NVIDIA/TensorRT-LLM/pull/14291
- [None][feat] opentelemetry metrics for num_postproc_workers > 0 disagg by @karen-sy in https://github.com/NVIDIA/TensorRT-LLM/pull/12637
- [TRTLLM-12385][feat] Use LPIPS score for visual gen model regression test by @yibinl-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/13567
- [None][chore] Remove closed bugs by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14217
- [https://nvbugs/6133201][fix] Bump GEN max_num_tokens in disagg perf YAMLs by @xwang233 in https://github.com/NVIDIA/TensorRT-LLM/pull/14191
- [None][feat] add single-rank MPI sleep/wakeup and rank-0 collective_rpc shim by @hhzhang16 in https://github.com/NVIDIA/TensorRT-LLM/pull/14052
- [https://nvbugs/6093911][fix] Fix disagg gen-only benchmark hang under ADP router imbalance by @chienchunhung in https://github.com/NVIDIA/TensorRT-LLM/pull/13347
- [None][fix] Import missing get_draft_token_length in py_executor by @nv-guomingz…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Major release candidate for LLM inference optimization by NVIDIA.