ReleaseNVIDIANVIDIApublished May 21, 2026seen 5d

NVIDIA/TensorRT-LLM v1.3.0rc15

NVIDIA/TensorRT-LLM

Open original ↗

Captured source

source ↗
published May 21, 2026seen 5dcaptured 9hhttp 200method plain

v1.3.0rc15

Repository: NVIDIA/TensorRT-LLM

Tag: v1.3.0rc15

Published: 2026-05-21T14:27:58Z

Prerelease: yes

Release notes:

Highlights

  • Model Support
  • Add Gemma4 multimodal model support with text, vision, audio, and chunked prefill capabilities (#12932, #14134)
  • Add Kimi K2.5 multimodal vision support and reasoning parser integration (#12788, #13801)
  • Add GPT-OSS, Ministral3, Nemotron-H, Nemotron Nano, and DeepSeek model enablement and compatibility updates (#12743, #12884, #13844, #13977)
  • Improve DeepSeek V4 and DeepSeek V3.2 support with new attention kernels, routing updates, tokenizer loading, and AutoConfig registration (#13652, #13186, #14261, #14293)
  • API
  • Add a typed exception hierarchy, shared classifier, retry-consumer migration, and typed Slurm infra failures (#13732, #13780, #13863, #13809, #14147)
  • Add VisualGen public output APIs, serving batch inference, and benchmark timing decomposition (#13635, #12350)
  • Add per-request media_io_kwargs support for chat completions (#13779)
  • Add per-rank iteration statistics and Attention-DP metrics to serving endpoints (#13221, #13649)
  • Add cache_salt_id support to the KV cache v2 manager (#13793)
  • Limit requested sampling logprobs as a breaking API change (#13520)
  • Feature
  • Improve MoE and fused-kernel performance with MegaMoE DeepGEMM, CUTEDSL MoE, shared-expert SwiGLU quantization, GDN fusion, bf16 FlashInfer MoE, and refreshed MoE cubins (#13384, #12884, #11897, #12966, #13689, #12440)
  • Add FP4 and FP8 decode kernels, FP4 DSA indexing, DeepSeek V4 attention kernels, FMHA head_dim 80 cubins, and multi-K and multi-dtype GVR Top-K support (#13929, #13219, #13340, #13652, #13808, #13948)
  • Improve VisualGen and diffusion pipelines with SageAttention for Wan/FLUX, fused cross-head QK Norm plus RoPE for WAN, LTX2 refactoring, and parallel VAE scaling (#13570, #13052, #13285, #13873)
  • Improve KV reuse, disaggregated serving, and transfer paths with transceiver v2 KV reuse, multi-threaded KV transfer, internal TRTLLM-Gen routing, additional conversation headers, and LoRA request-broadcast reduction (#13115, #13075, #13997, #13656, #12959)
  • Improve speculative decoding and hybrid-model execution with fractional synthetic acceptance rates, MTP block reuse, EAGLE3 rejection sampling, MTP max_draft_len decoupling, and mamba SSD prefill optimizations (#13569, #12896, #12588, #12341, #12731)
  • Improve performance tooling and runtime throughput with DFlash optimizations, host-profiler utilities, batch-full benchmark metrics, model-init NVLink caching, scheduling overhead reductions, beam-search overlap scheduling, and FC2 DenseGEMM autotuning (#13996, #11741, #13638, #14070, #13843, #14061, #13833)
  • Add CMake third-party cache support for clean builds (#13942)
  • Fix
  • Fix CUDA graph, profiling, and scheduling correctness issues including YAML CudaGraphConfig validation, profiler scoping, piecewise capture, Eagle3 hidden-state reuse, and guided decoding GIL handling (#13397, #12432, #13574, #13920, #13251)
  • Fix KV cache and scheduler behavior for FlashMLA token block overrides, mamba slot memory, delayed batching page release, adaptive ratio sampling, zero-layer mamba ranks, stale Scheduler V2 state, stale attention metadata, and chunked prefill EVS merging (#13752, #13489, #13805, #13857, #13999, #13592, #13696, #13754)
  • Fix model loading and quantization issues for GPT-OSS MXFP4, dummy weights, Mixtral modelopt export, DeepSeek V3 Lite FP8 MTP weights, composite HF configs, GLM-5 router GEMM, INT4 AWQ on SM120/121, and Qwen3 FP4 CUTLASS MoE OOM (#13708, #13879, #14179, #12530, #14068, #13740, #11561, #13349)
  • Fix serving and benchmark clients with hardened media URL loading, split SSE chunk parsing, aiohttp 3.13 streaming handling, /metrics tee-buffer serving, bounded gRPC payloads, router tokenizer skipping, unset attention_dp_relax handling, and clear GPT-OSS backend errors (#12748, #13686, #13952, #13405, #13519, #14030, #14276, #13166)
  • Fix distributed and disaggregated runtime stability for mamba disaggregation, worker preparation, PP executor shutdown, SM120 all-reduce launch, guided-decoding PP warmup barriers, Torch process-group teardown, Triton MoE memory freeing, and GB300 UCX settings (#13274, #13755, #13267, #13169, #13132, #12993, #14069, #14168)
  • Fix accuracy and memory regressions in DeepSeek, Nemotron, Qwen3, MTP, beam search, FMHA workspace sizing, and FP8 block-scaling autotuner cache growth (#13924, #13968, #13782, #14063, #13799, #13880, #14165)
  • Fix package, license, and compliance issues in llm-c standalone generation, SPDX headers, OSS headers, diffusers pinning, and broken documentation URLs (#14011, #14106, #14193, #14281, #13242, #13422)
  • Documentation
  • Add and update technical blogs for Helix Parallelism, Scaffolding, Gemma4, MoE as Dense GEMM on Blackwell, and VisualGen-related content (#13547, #11841, #13947, #13834, #14171)
  • Add DFlash quickstart updates, custom PyTorch backend kernel integration guidance, Gemma4 usage examples, spec-decoding support matrices, and layer-wise benchmark doc fixes (#13545, #13917, #14303, #14195, #13979)
  • Refresh image links and broken URLs in documentation and blog content (#13838, #13422)
  • Test & Infra
  • Add model and multimodal coverage for Wan 2.2 TI2V, nano v3 omni audio and video, Nemotron Ultra V3, Gemma4 CUDA graph registration, and W4A8_MXFP4_FP8 MoE unit tests (#13739, #13616, #13750, #13883, #13658, #14082, #13401)
  • Add and refresh performance coverage for VisualGen sanity, GB300 disaggregated NIXL, DSR1 disaggregated tests, trtllm-bench metrics, and Kimi K2.5 FP4 RCCA tests (#13144, #13594, #13882, #14178, #14172)
  • Improve change-based testing, CI triggers, GitHub checks, stage splitting, rerun handling, and LFS synchronization (#13382, #13899, #13993, #14022, #14064, #14035, #12406, #13826)
  • Improve build, dependency, and package infrastructure with FlashInfer updates, Transformers 5.x upgrades, compressed cubin archives, SBSA wheel image support, license scanning, and llm-c artifact cleanup (#13746, #13992, #14076, #12829, #13994, #13542, #12635, #13921, #13272)
  • Improve CI coverage organization by moving chunked-prefill cases, splitting long hardware-agnostic tests, adding feature-contract keys, and promoting DeepSeek-V4-Flash to the MoE CI subset (#14083, #13751, #13756, #13933, #13964)
  • Improve developer and CI operations with blossom-ci allowlist updates, skills naming enforcement, pre-commit validation,…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable inference library update release candidate.