What does this release signal mean?

NVIDIA published NVIDIA/TensorRT-LLM v1.3.0rc14 (NVIDIA/TensorRT-LLM). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: NVIDIA's library for optimizing and deploying large language models. · v1.3.0rc14 Repository: NVIDIA/TensorRT-LLM Tag: v1.3.0rc14 Published: 2026-05-07T05:55:19Z Prerelease: yes Release notes: Highlights - Model Support - Add prefix caching.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

NVIDIA Release: NVIDIA/TensorRT-LLM v1.3.0rc14

Captured source

source ↗

GitHub/github.com/NVIDIA/TensorRT-LLM

NVIDIA/TensorRT-LLM v1.3.0rc14

Source ↗

published May 7, 2026seen Jun 6captured Jun 11http 200method plain

v1.3.0rc14

Repository: NVIDIA/TensorRT-LLM

Tag: v1.3.0rc14

Published: 2026-05-07T05:55:19Z

Prerelease: yes

Release notes:

Highlights

Model Support
Add prefix caching for Mamba hybrid models including Qwen3.5 and Nemotron Super V3 (#12185)
Improve Qwen3.5 support with custom MoE routing and dense and NVFP4 weight loading fixes (#13433, #13090, #13716)
Improve Nemotron and Nemotron Nano support with GEMM tuning and multimodal placeholder expansion (#13160, #13069)
Add Wan 2.2 5B TI2V support and refine LTX-2 FP4 stage handling (#13256, #13244)

API
Embed VisualGenParams in DiffusionRequest and simplify generate() inputs (#13313)
Add llm.encode() fast path support for encoder-only models (#12801)
Add per-iteration request-aggregate counters to InflightBatchingStats (#13199)
Add AGSI middleware support for Serve (#13378)
Introduce cancellation support in transceiver v2 (#12734)
Fix Triton backend generation parameter handling for promptIgnoreLength, lengthPenalty, earlyStopping, and early_stopping (#13633, #13692)

Feature
Improve VisualGen serving with fast PNG compression, multi-node diffusion workers, non-contiguous multimodal chunked prefill, and Attention2D sequence parallelism (#13074, #13140, #12944, #12943)
Improve disaggregated serving and routing with gen-first ADP serving, KV-aware hit-rate gates and fair-share caps, and consolidated aiohttp session handling (#13112, #13198, #13408)
Expand kernel and runtime performance with GEMM-to-allreduce registered buffers, CuteDSL bf16 dense GEMMs, sparse-attention GVR Top-K dispatchers, fused add-norm-FP8 quantization, TF32 DSA GEMMs, sampler optimizations, and leaner MPI collectives (#11589, #12074, #13477, #12674, #13452, #13480, #13380, #13089)
Improve speculative decoding with DFlash one-model support, Mamba-2 rollback replay, radix-based SWA cleanup, and trtllm-gen routing refactoring (#12794, #13453, #13346, #13328)
Support NVFP4 weight updates (#12320)
Add per-rank torch profile traces for distributed profiling (#13536)

Fix
Fix KV cache and scheduler correctness issues, including WindowBlockManager statistics, Mamba cache handling under MTP with CUDA graph padding, free-block counter corruption, V2 extra_tokens accounting, PEFT page accumulation, and temporary attention-window cleanup (#12448, #13151, #12834, #13619, #13709, #13528, #12450)
Fix disaggregated serving and worker reliability by resolving aggregate PP4 hangs, preventing zombie worker pods, and correcting cached-token usage accounting (#12888, #12718, #13620)
Fix OpenAI and Triton generation flows for None tokenizers, prompt ignore lengths, early stopping, and terminateRequest handling from background logits threads (#13184, #13633, #13692, #13059)
Fix attention and VisualGen runtime issues, including UlyssesAttention sequence lengths, Ulysses plus Sage execution, TRTLLM-Gen GmemReduction illegal memory access, and low-memory Qwen3 skip-softmax behavior (#13486, #13440, #13541, #13581)
Fix distributed runtime stability with corrected pipeline-parallel layer distribution, reduced host-memory regression in speculative decoding, and MoE communication fallback after init exceptions (#13066, #13130, #13331)
Fix cache memory estimation for Qwen3 hybrid models in trtllm-bench and lower Eagle3 one-model acceptance thresholds for H20 (#13268, #13565)

Documentation
Add batch-size tuning guidance for CUDA graph padding and a GVR Top-K technical blog (#13393, #13714)
Remove outdated news items and clean up llmc licensing documentation (#13603, #13700)

Test & Infra
Add and refresh coverage for disaggregated post-merge performance, GPT-OSS 20B MHA, prefix-aware scheduling, cascade-prune repros, and issue-specific regressions (#13343, #12796, #13578, #13572, #13553)
Improve CI triage and failure analysis with Perf Triage Bot integration, rendered HTML failure reports, K8s infrastructure retry, PR base freshness checks, static test validation, and clearer Slurm pending logs (#12429, #13526, #13530, #13430, #13423, #13586)
Improve CI and build stability with lower test memory pressure, adjusted DeepEP token limits, CUDA line info defaults, Debug CUDA flag fixes, module-level skips, and longer FMHA timeouts (#13402, #13484, #13334, #13598, #13223, #12860)
Refresh test organization and dependencies with post-merge test moves, updated constraints, FlashInfer Python updates, B200 multimodal unit-test deduplication, and sorted waive enforcement (#13376, #13482, #13064, #13631, #13584, #12672)
Improve distributed and QA infrastructure with free-port FLUX/WAN test initialization, multinode fallback handling, NIXL-based perf sanity tests, QA popen workarounds, and KVCacheManager connector helper fixes (#13364, #13537, #13654, #13634, #13749)
Improve package and release infrastructure with llmc standalone package cleanup, release-scanning PLC nightly adjustments, devel-stage apt cache mounts, and pip cache reuse (#13466, #13694, #13245, #13510)

What's Changed

[https://nvbugs/6093714][fix] Reduce batch size and add memory guard for test by @govind-ramnarayan in https://github.com/NVIDIA/TensorRT-LLM/pull/13402
[TRTLLM-11373][refactor] Embed VisualGenParams in DiffusionRequest and simplify generate() inputs by @zhenhuaw-me in https://github.com/NVIDIA/TensorRT-LLM/pull/13313
[None][test] Update CI Post-Merge Disagg Perf Tests by @chenfeiz0326 in https://github.com/NVIDIA/TensorRT-LLM/pull/13343
[None][chore] AutoDeploy: Refactor finegrained FP8 scale sharding helpers by @galagam in https://github.com/NVIDIA/TensorRT-LLM/pull/12999
[https://nvbugs/6076564][fix] unwaive TestNemotronH::test_auto_dtype[trtllm-flashinfer_ssm-False] by @tcherckez-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/13187
[TRTLLM-10061][feat] Prefix caching support for mamba hybrid models by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/12185
[None][cleanup] remove legacy addSequence path by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/13280
[None][infra] Waive 1 failed cases for main in pre-merge 35790 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/13483
[None][fix] Fix bugs in WindowBlockManager destructor statistics by @eopXD in https://github.com/NVIDIA/TensorRT-LLM/pull/12448
[None][chore] Update CI allowlist 2026-04-23 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/13381
[None][fix] Consolidate aiohttp session management in disagg router...

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable release candidate for key LLM inference library