NVIDIA/TensorRT-LLM v1.3.0rc13
NVIDIA/TensorRT-LLM
Captured source
source ↗published Apr 29, 2026seen 5dcaptured 10hhttp 200method plain
v1.3.0rc13
Repository: NVIDIA/TensorRT-LLM
Tag: v1.3.0rc13
Published: 2026-04-29T05:47:54Z
Prerelease: yes
Release notes:
Highlights
- Model Support
- Support and initial optimizations for Nemotron 3 Nano Omni; known issues for audio-from-video and chunked prefill for video being actively worked on
- Add audio extraction from video, optimize ViT attention, and reduce initialization memory for Nemotron and Nemotron Nano VL models (#12921, #12911, #13283)
- Add per-model VisualGen example scripts, shared configs, per-model defaults, and metadata updates (#12992, #12862)
- Add GLM-4.7 and GLM-5 tool parser support (#13150)
- Optimize Nemotron-H execution from the Python layer and preserve Nemotron HF mamba cache dtype during bench tuning (#13032, #12826)
- Improve DeepSeek-V3.2 and DeepSeek-V3-Lite support with targeted perf and chunked-prefill fixes on Blackwell and SM100-class GPUs (#13142, #13257)
- API
- Fix the chunked prefill API contract for Nemotron Nano VL (#13025)
- Add abort and resume support for Async RL in verl (#12272)
- Add a modular logger with automatic module detection and per-module filtering (#13202)
- Improve prompt handling by accounting for existing multimodal placeholder tokens in text prompts (#12827)
- Propagate real server-side failures to disaggregated serving clients and improve empty-file handling in trtllm-bench (#13119, #12552)
- Feature
- Add VisualGen Cache-DiT and a unified cache accelerator (#12548)
- Expand kernel support with broader RMSNorm coverage, optimized causal-conv1d prefill and decode, FP4 residual quantization, and refreshed SageAttention kernels (#13033, #13103, #13117, #12937)
- Add batched addSequence with two-phase claim and unified VSWA and non-reuse support (#13029)
- Add sparse MQA and GQA attention support and introduce new sharding infrastructure (#12470, #12419)
- Improve serving performance with async media loading, faster video frame decoding, cached text computation reuse, lower custom-op overhead, padding-aware CUDA graph tuning, and reduced single-rank broadcast overhead (#13034, #12677, #13149, #12895, #13412, #13259, #11640)
- Optimize runtime internals with Minimax RMSNorm tuning, consolidated prefix-reuse analysis, gen-only sync transfer v2, DWDP contention config cleanup, and round-robin CP cache transmission (#12163, #13095, #12882, #12974, #13180)
- Restore EAGLE3 dynamic-tree speculative decoding support and centralize perfect-router integration and validation (#13081, #13250)
- Fix
- Fix KV cache and scheduler correctness issues, including SWA compatibility, token accounting with context chunking, over-allocation in VSWA plus EAGLE flows, KVCacheManagerV2 bugs, and multimodal and disaggregated cache reuse problems (#12968, #12976, #12855, #12306, #13104, #12472)
- Fix runtime stability issues by preventing benchmark fill-loop hangs, tightening warmup reservation behavior, and making host-memory-based prefetch decisions consistent across ranks (#13065, #13078, #13161)
- Fix EAGLE3 LoRA speculative decoding and preserve speculative layer weights to avoid MTP plus PP hangs (#13005, #12555)
- Fix FMHA and attention runtime issues, including SM90 full-mask skip-softmax dispatch, misleading generation warnings, stale CUDA graphs on beam-width changes, and FlashInfer KV layout handling (#13120, #13157, #13255, #13190)
- Fix vision and multimodal correctness issues, including KV-cache quantization leaks into the vision encoder, FLUX high-resolution scheduler off-by-one behavior, and Super V3 multi-stream MoE instability (#13181, #13091, #13122)
- Fix packaging and environment issues by restoring the missing aarch64 library, enforcing NCCL >= 2.28 at configure time, and using weights_only=True in LoRA manager loads (#13206, #13108, #13391)
- Fix operational reliability issues in CI and perf pipelines, including OpenSearch upload failures, hanging AIPerf metrics, SLURM host name propagation, and SLURM submission retry behavior (#13215, #13314, #13367, #12778)
- Fix additional model and runtime issues for Qwen3 mrope cache handling, DSA illegal memory access with CUDA graph plus host KV offload, stale tokenizer alias imports, and WAN example timing conflicts (#13269, #13124, #13086, #13193, #12128)
- Documentation
- Restructure installation documentation and refresh verbose comments (#12402, #13387)
- Update invalid Dynamo documentation URLs (#13038)
- Test & Infra
- Add Dynamo API compatibility tests, VisualGen regression coverage, and refactor MoE communication tests (#12970, #13372, #12841)
- Expand CI coverage for disaggregated serving and weekly performance suites, including K2.5 EPLB coverage, refreshed Nemotron datasets, and additional weekly perf models (#13185, #12982, #13325)
- Improve CI signal quality by splitting multimodal DGX_B200 jobs, removing obsolete or low-priority cases, dropping non-key-model L0 coverage, and moving bf16 and auto precision variants to post-merge (#12978, #13262, #13374, #13315, #13366)
- Improve CI tooling with PR-aware failure analysis, SwiftStack upload support, wildcard bot stage commands, a sync_qa_tests Jenkins script, doc tests, and markdown-only doc-build rules (#12849, #13291, #12881, #13028, #13152, #13358, #13441)
- Refresh repository ownership and security plumbing with CODEOWNERS updates, HMAC key enforcement, and container vulnerability fixes (#13110, #13213, #9850, #13447)
What's Changed
- [https://nvbugs/5997092][fix] Remove waives for DS-V3.2/R1 FP4 Blackkwell perf tests by @peihu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/13042
- [None][infra] Waive 2 failed cases for main in post-merge by @xinhe-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/13105
- [TRTLLM-9132][infra] Update to ignore failure for release check and building images by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/9871
- [https://nvbugs/5626259][fix] Enable nemotron-h chunk prefill test by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/12980
- [None][feat] Add the invocation path for mamba2 mtp custom op by @JadoTu in https://github.com/NVIDIA/TensorRT-LLM/pull/12787
- [None][infra] Waive 4 failed cases for main in post-merge 2654 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/13113
- [None][infra] Waive 3 failed cases for main in post-merge 2658 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/13141
- [None][chore] Add CODEOWNERS mappings for @NVIDIA/trt-llm-multimodal-devs by @venkywonka in…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Pre-release update, routine maintenance.