ReleaseNVIDIANVIDIApublished Jun 10, 2026seen 1d

NVIDIA/TensorRT-LLM v1.3.0rc18

NVIDIA/TensorRT-LLM

Open original ↗

Captured source

source ↗
published Jun 10, 2026seen 1dcaptured 1dhttp 200method plain

v1.3.0rc18

Repository: NVIDIA/TensorRT-LLM

Tag: v1.3.0rc18

Published: 2026-06-10T00:10:37Z

Prerelease: yes

Release notes:

  • Known Issues
  • DSV3.2 will crash with an IMA in various long-running perf tests on GB200/GB300 when the CuteDSL MoE backend is used. Work around this issue by using another MoE backend.
  • Model Support
  • Support Nemotron-H NVFP4 checkpoint on Hopper (#14775)
  • Add Qwen image support (#13449)
  • Support Step-3.7-Flash model (#14711)
  • Add Cosmos3-Nano and Cosmos3-Super support (#14824)
  • Add AFMoE Trinity support (#13148)
  • API
  • Add logprobs_simple_format option to return logprobs as a flat list[float] (#13972)
  • trtllm-serve, trtllm-eval, trtllm-bench: Make CLI flags take precedence over --config / --extra_llm_api_options YAML (#14812)
  • Feature
  • Upgrade NIXL to v1.0.1 and UCX to 1.21 (#14436)
  • Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL (#14453)
  • Enable FlashInfer GDN decoding kernel for Qwen3.5 (#13645)
  • Add per-expert LoRA support with Cutlass backend (#14801)
  • Reduce OpenAI stream postprocess overhead (#14708)
  • Add encoder CUDA graph support to llm.encode() (#14326)
  • Use a Triton kernel for C++ mamba hybrid state update (#14869)
  • Fuse masked gather + finalize-scale into one Triton kernel in DeepGemmFusedMoE (#14592)
  • Support KVCacheManagerV2 adjust() in single GPU + agg PyExecutor loop (#14578)
  • Add disk cache config for KVCacheManagerV2 (#14845)
  • Add Wan I2V generation example (#14981)
  • Add LTX-2 visual generation example (#14976)
  • Update flashinfer-python from 0.6.12rc2 to 0.6.12 (#14805)
  • Fix
  • Fix mamba-out-of-block error with ADP + BS=1 + disagg (#14853)
  • Fix XQA IMA for invalid pages with sliding window (#14459)
  • Propagate event loop errors to await_responses callers (#12735)
  • Fix Mamba replay mode accuracy issues (#14509)
  • Fix PyExecutor hang in disagg TP prefill (#14020)
  • Fix stale runtime metadata issues during MLA fallback transitions (#14049)
  • Fix KVCacheManagerV2 block counting correctness issues (#14725)
  • Canonicalize multimodal cache-key serialization to prevent hash collisions (#14800)
  • Fix LTX-2 audio PE padding issues (#14818)
  • Release KVCacheManagerV1 blocks on MAX_UTILIZATION pause (#14723)
  • Fix config sharing issue for Qwen3-VL (#14766)
  • Enforce request and buffer index lifecycle integrity (#14768)
  • Add nemotron-v3 as the proper nemotron-h reasoning parser (#14900)
  • Clamp KV pool window sizes to max_seq_len (#14905)
  • Fix mamba block calculation (#14524)
  • Add trust_remote_code=True to the LLM(...) constructor to fix various model loading issues (#14892)
  • Fix deep EP partial warp sync for GPT-OSS shapes (#14977)
  • Add warmup for trtllm-gen fmha JIT kernels (#14851)
  • Documentation
  • Add VisualGen API walkthrough example and docs page (#14685)
  • Add Nemotron 3 Ultra doc (#14964, #15113)
  • Test & Infra
  • Pipe stderr separately in subprocess calls to improve error reporting in Allure (#14750)
  • Remove obsolete tests (#14995, #14660, #14992, #14952, #14749)
  • Parallelize post stages: Rerun Report, Test Coverage, and AI Failure Analysis (#14528)
  • Relocate tests to right-sized stages (#14684)
  • Move non-default-feature tests to post merge (#15038)

What's Changed

  • [None][test] Update datasets path by @JennyLiu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14671
  • [None][infra] Update new .test_durations by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14661
  • [TRTLLM-13015][feat] drop complex visual_gen CLI example scripts by @zhenhuaw-me in https://github.com/NVIDIA/TensorRT-LLM/pull/14632
  • [https://nvbugs/6117811][fix] Fix XQA IMA for invalid pages with sliding window by @pengbowang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14459
  • [None][feat] Tune mamba config by env variables by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/14730
  • [None][test] Update moe backend for ctx and acceptance length env by @fredricz-20070104 in https://github.com/NVIDIA/TensorRT-LLM/pull/14803
  • [None][test] Update precision of previous device step time by @fredricz-20070104 in https://github.com/NVIDIA/TensorRT-LLM/pull/14809
  • [None][infra] Waive 12 failed cases for main in post-merge 2749 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14802
  • [TRTLLM-12971][infra] Fix parse classname logic in timeout result by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/14559
  • [https://nvbugs/6038228][fix] Propagate event loop errors to await_responses callers by @JunyiXu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/12735
  • [TRTLLM-12288][feat] Support Nemotron-H nvfp4 ckpt on Hopper by @JadoTu in https://github.com/NVIDIA/TensorRT-LLM/pull/14775
  • [TRTLLM-12596][feat] Support simple logprob format by @tongyuantongyu in https://github.com/NVIDIA/TensorRT-LLM/pull/13972
  • [None][fix] Stabilize Mamba replay state update by @sunnyqgg in https://github.com/NVIDIA/TensorRT-LLM/pull/14509
  • [None][feat] Upgrade NIXL to v1.0.1 and UCX to 1.21 by @chuangz0 in https://github.com/NVIDIA/TensorRT-LLM/pull/14436
  • [None][feat] Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL composite VA by @tianyuz-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14453
  • [TRTLLM-10947][perf] eagle3: use cudaMemcpy2DAsync custom op for hidden-state capture by @pcicotti in https://github.com/NVIDIA/TensorRT-LLM/pull/14479
  • [None][fix] PyExecutor Hang in Disagg TP Prefill by @jthomson04 in https://github.com/NVIDIA/TensorRT-LLM/pull/14020
  • [https://nvbugs/6240561][fix] Autodeploy fix the deepseek accuracy drop by @nvchenghaoz in https://github.com/NVIDIA/TensorRT-LLM/pull/14774
  • [#12702][feat] Autodeploy deprecate the legacy triton attention by @nvchenghaoz in https://github.com/NVIDIA/TensorRT-LLM/pull/14194
  • [None][test] Waive 5 failed cases for main in QA CI by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14789
  • [None][test] Waive 7 failed cases for main in QA CI by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14791
  • [https://nvbugs/6240561][fix] Fix AutoDeploy DeepSeek-R1 accuracy drop by @taylor-yb-lee in https://github.com/NVIDIA/TensorRT-LLM/pull/14793
  • [#14588][fix] [AutoDeploy] Fix OOM of DeepSeek-R1 NVFP4 for tp=4 by @taylor-yb-lee in https://github.com/NVIDIA/TensorRT-LLM/pull/14477
  • [https://nvbugs/6179761][fix] Save LTX-2 BF16 weights to speed up perf by @yibinl-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/14639
  • [TRTLLM-13028][doc] Add VisualGen API walkthrough example and docs page by…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine release candidate of an optimization library.