NVIDIA/TensorRT-LLM v1.3.0rc14
NVIDIA/TensorRT-LLM
Captured source
source ↗published May 7, 2026seen 5dcaptured 9hhttp 200method plain
v1.3.0rc14
Repository: NVIDIA/TensorRT-LLM
Tag: v1.3.0rc14
Published: 2026-05-07T05:55:19Z
Prerelease: yes
Release notes:
Highlights
- Model Support
- Add prefix caching for Mamba hybrid models including Qwen3.5 and Nemotron Super V3 (#12185)
- Improve Qwen3.5 support with custom MoE routing and dense and NVFP4 weight loading fixes (#13433, #13090, #13716)
- Improve Nemotron and Nemotron Nano support with GEMM tuning and multimodal placeholder expansion (#13160, #13069)
- Add Wan 2.2 5B TI2V support and refine LTX-2 FP4 stage handling (#13256, #13244)
- API
- Embed VisualGenParams in DiffusionRequest and simplify generate() inputs (#13313)
- Add llm.encode() fast path support for encoder-only models (#12801)
- Add per-iteration request-aggregate counters to InflightBatchingStats (#13199)
- Add AGSI middleware support for Serve (#13378)
- Introduce cancellation support in transceiver v2 (#12734)
- Fix Triton backend generation parameter handling for promptIgnoreLength, lengthPenalty, earlyStopping, and early_stopping (#13633, #13692)
- Feature
- Improve VisualGen serving with fast PNG compression, multi-node diffusion workers, non-contiguous multimodal chunked prefill, and Attention2D sequence parallelism (#13074, #13140, #12944, #12943)
- Improve disaggregated serving and routing with gen-first ADP serving, KV-aware hit-rate gates and fair-share caps, and consolidated aiohttp session handling (#13112, #13198, #13408)
- Expand kernel and runtime performance with GEMM-to-allreduce registered buffers, CuteDSL bf16 dense GEMMs, sparse-attention GVR Top-K dispatchers, fused add-norm-FP8 quantization, TF32 DSA GEMMs, sampler optimizations, and leaner MPI collectives (#11589, #12074, #13477, #12674, #13452, #13480, #13380, #13089)
- Improve speculative decoding with DFlash one-model support, Mamba-2 rollback replay, radix-based SWA cleanup, and trtllm-gen routing refactoring (#12794, #13453, #13346, #13328)
- Support NVFP4 weight updates (#12320)
- Add per-rank torch profile traces for distributed profiling (#13536)
- Fix
- Fix KV cache and scheduler correctness issues, including WindowBlockManager statistics, Mamba cache handling under MTP with CUDA graph padding, free-block counter corruption, V2 extra_tokens accounting, PEFT page accumulation, and temporary attention-window cleanup (#12448, #13151, #12834, #13619, #13709, #13528, #12450)
- Fix disaggregated serving and worker reliability by resolving aggregate PP4 hangs, preventing zombie worker pods, and correcting cached-token usage accounting (#12888, #12718, #13620)
- Fix OpenAI and Triton generation flows for None tokenizers, prompt ignore lengths, early stopping, and terminateRequest handling from background logits threads (#13184, #13633, #13692, #13059)
- Fix attention and VisualGen runtime issues, including UlyssesAttention sequence lengths, Ulysses plus Sage execution, TRTLLM-Gen GmemReduction illegal memory access, and low-memory Qwen3 skip-softmax behavior (#13486, #13440, #13541, #13581)
- Fix distributed runtime stability with corrected pipeline-parallel layer distribution, reduced host-memory regression in speculative decoding, and MoE communication fallback after init exceptions (#13066, #13130, #13331)
- Fix cache memory estimation for Qwen3 hybrid models in trtllm-bench and lower Eagle3 one-model acceptance thresholds for H20 (#13268, #13565)
- Documentation
- Add batch-size tuning guidance for CUDA graph padding and a GVR Top-K technical blog (#13393, #13714)
- Remove outdated news items and clean up llmc licensing documentation (#13603, #13700)
- Test & Infra
- Add and refresh coverage for disaggregated post-merge performance, GPT-OSS 20B MHA, prefix-aware scheduling, cascade-prune repros, and issue-specific regressions (#13343, #12796, #13578, #13572, #13553)
- Improve CI triage and failure analysis with Perf Triage Bot integration, rendered HTML failure reports, K8s infrastructure retry, PR base freshness checks, static test validation, and clearer Slurm pending logs (#12429, #13526, #13530, #13430, #13423, #13586)
- Improve CI and build stability with lower test memory pressure, adjusted DeepEP token limits, CUDA line info defaults, Debug CUDA flag fixes, module-level skips, and longer FMHA timeouts (#13402, #13484, #13334, #13598, #13223, #12860)
- Refresh test organization and dependencies with post-merge test moves, updated constraints, FlashInfer Python updates, B200 multimodal unit-test deduplication, and sorted waive enforcement (#13376, #13482, #13064, #13631, #13584, #12672)
- Improve distributed and QA infrastructure with free-port FLUX/WAN test initialization, multinode fallback handling, NIXL-based perf sanity tests, QA popen workarounds, and KVCacheManager connector helper fixes (#13364, #13537, #13654, #13634, #13749)
- Improve package and release infrastructure with llmc standalone package cleanup, release-scanning PLC nightly adjustments, devel-stage apt cache mounts, and pip cache reuse (#13466, #13694, #13245, #13510)
What's Changed
- [https://nvbugs/6093714][fix] Reduce batch size and add memory guard for test by @govind-ramnarayan in https://github.com/NVIDIA/TensorRT-LLM/pull/13402
- [TRTLLM-11373][refactor] Embed VisualGenParams in DiffusionRequest and simplify generate() inputs by @zhenhuaw-me in https://github.com/NVIDIA/TensorRT-LLM/pull/13313
- [None][test] Update CI Post-Merge Disagg Perf Tests by @chenfeiz0326 in https://github.com/NVIDIA/TensorRT-LLM/pull/13343
- [None][chore] AutoDeploy: Refactor finegrained FP8 scale sharding helpers by @galagam in https://github.com/NVIDIA/TensorRT-LLM/pull/12999
- [https://nvbugs/6076564][fix] unwaive TestNemotronH::test_auto_dtype[trtllm-flashinfer_ssm-False] by @tcherckez-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/13187
- [TRTLLM-10061][feat] Prefix caching support for mamba hybrid models by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/12185
- [None][cleanup] remove legacy addSequence path by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/13280
- [None][infra] Waive 1 failed cases for main in pre-merge 35790 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/13483
- [None][fix] Fix bugs in WindowBlockManager destructor statistics by @eopXD in https://github.com/NVIDIA/TensorRT-LLM/pull/12448
- [None][chore] Update CI allowlist 2026-04-23 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/13381
- [None][fix] Consolidate aiohttp session management in disagg router…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable release candidate for key LLM inference library