What does this release signal mean?

NVIDIA published NVIDIA/TensorRT-LLM v1.3.0rc16 (NVIDIA/TensorRT-LLM). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: NVIDIA's inference library for large language models. · v1.3.0rc16 Repository: NVIDIA/TensorRT-LLM Tag: v1.3.0rc16 Published: 2026-05-26T08:08:12Z Prerelease: yes Release notes: Highlights - Model Support - Add Gemma4.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

NVIDIA Release: NVIDIA/TensorRT-LLM v1.3.0rc16

Captured source

source ↗

GitHub/github.com/NVIDIA/TensorRT-LLM

NVIDIA/TensorRT-LLM v1.3.0rc16

Source ↗

published May 26, 2026seen Jun 6captured Jun 11http 200method plain

v1.3.0rc16

Repository: NVIDIA/TensorRT-LLM

Tag: v1.3.0rc16

Published: 2026-05-26T08:08:12Z

Prerelease: yes

Release notes:

Highlights

Model Support
Add Gemma4 multimodal support with native vision and audio towers (#14300)
Add Qwen3.5 MTP and Qwen3.6-27B-FP8 model support (#12646, #14359)
Add EXAONE-4.5 and Laguna model support (#12873, #13559)
Switch DeepSeek, NemotronH, Qwen3, and Qwen3.5-MoE to sharding-IR canonical models (#13478)

API
Refactor the VisualGenArgs API and registry (#14175)
Drop sink_token_length from the PyTorch attention surface (#14275)
Add OpenAI chat logit bias validation (#13518)
Reject incompatible KV connector configurations at construction time (#13577)

Feature
Add exact multimodal KV block hashing and KV cache reuse probing (#13815, #14333)
Add KV cache manager v2 with Python transceiver updates (#12928)
Add disaggregated serving support with block reuse enabled for hybrid models (#14060)
Add FlashInfer MLA attention backend support and SkipSoftmax sparse attention support for visual generation (#13428, #12947)
Add Ring Attention and unified context parallelism for VisualGen (#13821)
Add legacy and TensorRT-LLM 1.x modelopt quantization config support (#14088)
Add debugging environment variables for mamba modules (#14170)
Add single-rank MPI sleep/wakeup support and a rank-0 collective_rpc shim (#14052)
Add opentelemetry metrics for disaggregated serving with multiple postprocessing workers (#12637)
Support SWA scratch reuse rewind (#14412)
Improve FMHA, FlashInfer TRTLLM-Gen, and KV cache buffer calculation paths (#14291, #12525)
Improve fused-kernel and attention performance with shared-expert combine fusion, paged MQA logits decode tuning, LTX2 fused RMSNorm/RoPE, EAGLE3 dynamic tree kernel optimizations, and cu_seqlens conversion updates (#14306, #14133, #13985, #13426, #13566)
Optimize beam search candidate reconstruction by skipping prompt-prefix copies (#14197)
Update cubins to resolve the FMHA PDL issue (#14462)
Use CUDA 13 CUTLASS DSL package (#14354)

Fix
Fix disaggregated benchmark, usage propagation, and worker registration stability issues (#13347, #14177, #14289)
Fix DeepSeek-V3 OOM handling and artifacts paths (#14232)
Fix missing get_draft_token_length import in py_executor (#14366)
Fix Lora load failure handling (#13517)
Fix Kimi K2.5 speculative decoding behavior (#14379)
Fix Qwen3HybridConfig layer_types derivation and route load_hf_model_config through AutoConfig (#13832, #14410)
Fix CppMambaHybridCacheManager functional and performance issues (#14003)
Fix MTP disaggregated speculative_config coverage (#14391)
Fix KVCacheTransfer divide-by-zero and KV cache grain slot refinement issues (#13618, #14442)
Fix memory usage during refit and EPLB config model loading (#14331, #11962)
Fix MPI worker allocator configuration and GB300 cluster environment setup (#14152, #14460)
Fix profiler runner exception handling with synchronized CUDA cleanup (#13469)
Disable mamba replay by default (#14471)

Documentation
Add a Claude skill for multimodal model onboarding (#13842)
Update Gemma 4 entries in supported-models.md (#14463)
Fix invalid documentation and deployment guide links (#14337, #14522)

Benchmark
Add LPIPS scoring for visual generation model regression tests (#13567)
Add a bench_moe microbenchmark (#14507)
Update visual generation and accuracy thresholds for Wan 2.2, Qwen3.5-4B DFlash, and Nano V3 (#14372, #14411, #14078)
Disable ignore-eos when using speculative decoding in performance tests (#14347)

Test & Infra
Split verl tests into fine-grained per-case wrappers (#14037)
Add new stress cases (#14390)
Clean outdated test duration entries and remove deprecated disaggregated sampler and spark test cases (#14340, #14335, #14380)
Isolate ray tests to avoid GCS timeout in a single pytest session (#14342)
Improve L0 retry timeout budgeting and cap infra retry attempts (#14323, #14415)
Handle sacct errors when checking Slurm job status (#14367)
Fix B300 MegaMoE and MoE test selection (#14362, #14401)
Fix container scanning according to the latest security team guidance (#14430)
Deduplicate miscellaneous unit tests on B200 (#14525)

What's Changed

[None][chore] Update Claude Code agents and skills by @kaiyux in https://github.com/NVIDIA/TensorRT-LLM/pull/14344
[None][perf] Fuse sigmoid+mul+add shared-expert combine into one Trit… by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/14306
[None][infra] Waive 1 failed cases for main in pre-merge 38925 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14346
[None][infra] Revert Mingyang back to mingyangHao in allowlist by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14349
[None][cleanup] MistralSmall related cleanups by @2ez4bz in https://github.com/NVIDIA/TensorRT-LLM/pull/14271
[None][chore] Clean test_durations file by removing outdated items. by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/14340
[None][infra] Waive 2 failed cases for main in post-merge 2725 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14357
[None][feat] Exact multimodal KV blockhashing by @venkywonka in https://github.com/NVIDIA/TensorRT-LLM/pull/13815
[None][infra] Waive 1 failed cases for main in pre-merge 38987 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14350
[None][feat] Update the logic of FMHA JIT path by @heyuhhh in https://github.com/NVIDIA/TensorRT-LLM/pull/14291
[None][feat] opentelemetry metrics for num_postproc_workers > 0 disagg by @karen-sy in https://github.com/NVIDIA/TensorRT-LLM/pull/12637
[TRTLLM-12385][feat] Use LPIPS score for visual gen model regression test by @yibinl-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/13567
[None][chore] Remove closed bugs by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14217
[https://nvbugs/6133201][fix] Bump GEN max_num_tokens in disagg perf YAMLs by @xwang233 in https://github.com/NVIDIA/TensorRT-LLM/pull/14191
[None][feat] add single-rank MPI sleep/wakeup and rank-0 collective_rpc shim by @hhzhang16 in https://github.com/NVIDIA/TensorRT-LLM/pull/14052
[https://nvbugs/6093911][fix] Fix disagg gen-only benchmark hang under ADP router imbalance by @chienchunhung in https://github.com/NVIDIA/TensorRT-LLM/pull/13347
[None][fix] Import missing get_draft_token_length in py_executor by @nv-guomingz...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Major release candidate for LLM inference optimization by NVIDIA.