ReleaseNVIDIANVIDIApublished May 7, 2026seen 5d

NVIDIA/TensorRT-LLM v1.3.0rc14

NVIDIA/TensorRT-LLM

Open original ↗

Captured source

source ↗
published May 7, 2026seen 5dcaptured 9hhttp 200method plain

v1.3.0rc14

Repository: NVIDIA/TensorRT-LLM

Tag: v1.3.0rc14

Published: 2026-05-07T05:55:19Z

Prerelease: yes

Release notes:

Highlights

  • Model Support
  • Add prefix caching for Mamba hybrid models including Qwen3.5 and Nemotron Super V3 (#12185)
  • Improve Qwen3.5 support with custom MoE routing and dense and NVFP4 weight loading fixes (#13433, #13090, #13716)
  • Improve Nemotron and Nemotron Nano support with GEMM tuning and multimodal placeholder expansion (#13160, #13069)
  • Add Wan 2.2 5B TI2V support and refine LTX-2 FP4 stage handling (#13256, #13244)
  • API
  • Embed VisualGenParams in DiffusionRequest and simplify generate() inputs (#13313)
  • Add llm.encode() fast path support for encoder-only models (#12801)
  • Add per-iteration request-aggregate counters to InflightBatchingStats (#13199)
  • Add AGSI middleware support for Serve (#13378)
  • Introduce cancellation support in transceiver v2 (#12734)
  • Fix Triton backend generation parameter handling for promptIgnoreLength, lengthPenalty, earlyStopping, and early_stopping (#13633, #13692)
  • Feature
  • Improve VisualGen serving with fast PNG compression, multi-node diffusion workers, non-contiguous multimodal chunked prefill, and Attention2D sequence parallelism (#13074, #13140, #12944, #12943)
  • Improve disaggregated serving and routing with gen-first ADP serving, KV-aware hit-rate gates and fair-share caps, and consolidated aiohttp session handling (#13112, #13198, #13408)
  • Expand kernel and runtime performance with GEMM-to-allreduce registered buffers, CuteDSL bf16 dense GEMMs, sparse-attention GVR Top-K dispatchers, fused add-norm-FP8 quantization, TF32 DSA GEMMs, sampler optimizations, and leaner MPI collectives (#11589, #12074, #13477, #12674, #13452, #13480, #13380, #13089)
  • Improve speculative decoding with DFlash one-model support, Mamba-2 rollback replay, radix-based SWA cleanup, and trtllm-gen routing refactoring (#12794, #13453, #13346, #13328)
  • Support NVFP4 weight updates (#12320)
  • Add per-rank torch profile traces for distributed profiling (#13536)
  • Fix
  • Fix KV cache and scheduler correctness issues, including WindowBlockManager statistics, Mamba cache handling under MTP with CUDA graph padding, free-block counter corruption, V2 extra_tokens accounting, PEFT page accumulation, and temporary attention-window cleanup (#12448, #13151, #12834, #13619, #13709, #13528, #12450)
  • Fix disaggregated serving and worker reliability by resolving aggregate PP4 hangs, preventing zombie worker pods, and correcting cached-token usage accounting (#12888, #12718, #13620)
  • Fix OpenAI and Triton generation flows for None tokenizers, prompt ignore lengths, early stopping, and terminateRequest handling from background logits threads (#13184, #13633, #13692, #13059)
  • Fix attention and VisualGen runtime issues, including UlyssesAttention sequence lengths, Ulysses plus Sage execution, TRTLLM-Gen GmemReduction illegal memory access, and low-memory Qwen3 skip-softmax behavior (#13486, #13440, #13541, #13581)
  • Fix distributed runtime stability with corrected pipeline-parallel layer distribution, reduced host-memory regression in speculative decoding, and MoE communication fallback after init exceptions (#13066, #13130, #13331)
  • Fix cache memory estimation for Qwen3 hybrid models in trtllm-bench and lower Eagle3 one-model acceptance thresholds for H20 (#13268, #13565)
  • Documentation
  • Add batch-size tuning guidance for CUDA graph padding and a GVR Top-K technical blog (#13393, #13714)
  • Remove outdated news items and clean up llmc licensing documentation (#13603, #13700)
  • Test & Infra
  • Add and refresh coverage for disaggregated post-merge performance, GPT-OSS 20B MHA, prefix-aware scheduling, cascade-prune repros, and issue-specific regressions (#13343, #12796, #13578, #13572, #13553)
  • Improve CI triage and failure analysis with Perf Triage Bot integration, rendered HTML failure reports, K8s infrastructure retry, PR base freshness checks, static test validation, and clearer Slurm pending logs (#12429, #13526, #13530, #13430, #13423, #13586)
  • Improve CI and build stability with lower test memory pressure, adjusted DeepEP token limits, CUDA line info defaults, Debug CUDA flag fixes, module-level skips, and longer FMHA timeouts (#13402, #13484, #13334, #13598, #13223, #12860)
  • Refresh test organization and dependencies with post-merge test moves, updated constraints, FlashInfer Python updates, B200 multimodal unit-test deduplication, and sorted waive enforcement (#13376, #13482, #13064, #13631, #13584, #12672)
  • Improve distributed and QA infrastructure with free-port FLUX/WAN test initialization, multinode fallback handling, NIXL-based perf sanity tests, QA popen workarounds, and KVCacheManager connector helper fixes (#13364, #13537, #13654, #13634, #13749)
  • Improve package and release infrastructure with llmc standalone package cleanup, release-scanning PLC nightly adjustments, devel-stage apt cache mounts, and pip cache reuse (#13466, #13694, #13245, #13510)

What's Changed

  • [https://nvbugs/6093714][fix] Reduce batch size and add memory guard for test by @govind-ramnarayan in https://github.com/NVIDIA/TensorRT-LLM/pull/13402
  • [TRTLLM-11373][refactor] Embed VisualGenParams in DiffusionRequest and simplify generate() inputs by @zhenhuaw-me in https://github.com/NVIDIA/TensorRT-LLM/pull/13313
  • [None][test] Update CI Post-Merge Disagg Perf Tests by @chenfeiz0326 in https://github.com/NVIDIA/TensorRT-LLM/pull/13343
  • [None][chore] AutoDeploy: Refactor finegrained FP8 scale sharding helpers by @galagam in https://github.com/NVIDIA/TensorRT-LLM/pull/12999
  • [https://nvbugs/6076564][fix] unwaive TestNemotronH::test_auto_dtype[trtllm-flashinfer_ssm-False] by @tcherckez-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/13187
  • [TRTLLM-10061][feat] Prefix caching support for mamba hybrid models by @VALLIS-NERIA in https://github.com/NVIDIA/TensorRT-LLM/pull/12185
  • [None][cleanup] remove legacy addSequence path by @liji-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/13280
  • [None][infra] Waive 1 failed cases for main in pre-merge 35790 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/13483
  • [None][fix] Fix bugs in WindowBlockManager destructor statistics by @eopXD in https://github.com/NVIDIA/TensorRT-LLM/pull/12448
  • [None][chore] Update CI allowlist 2026-04-23 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/13381
  • [None][fix] Consolidate aiohttp session management in disagg router…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable release candidate for key LLM inference library