What does this release signal mean?

NVIDIA published NVIDIA/TensorRT-LLM v1.3.0rc18 (NVIDIA/TensorRT-LLM). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: NVIDIA's library for optimized LLM inference on GPUs. · v1.3.0rc18 Repository: NVIDIA/TensorRT-LLM Tag: v1.3.0rc18 Published: 2026-06-10T00:10:37Z Prerelease: yes Release notes: - Known Issues - DSV3.2 will crash with an IMA.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

NVIDIA Release: NVIDIA/TensorRT-LLM v1.3.0rc18

Captured source

source ↗

GitHub/github.com/NVIDIA/TensorRT-LLM

NVIDIA/TensorRT-LLM v1.3.0rc18

Source ↗

published Jun 10, 2026seen Jun 10captured Jun 10http 200method plain

v1.3.0rc18

Repository: NVIDIA/TensorRT-LLM

Tag: v1.3.0rc18

Published: 2026-06-10T00:10:37Z

Prerelease: yes

Release notes:

Known Issues
DSV3.2 will crash with an IMA in various long-running perf tests on GB200/GB300 when the CuteDSL MoE backend is used. Work around this issue by using another MoE backend.

Model Support
Support Nemotron-H NVFP4 checkpoint on Hopper (#14775)
Add Qwen image support (#13449)
Support Step-3.7-Flash model (#14711)
Add Cosmos3-Nano and Cosmos3-Super support (#14824)
Add AFMoE Trinity support (#13148)

API
Add logprobs_simple_format option to return logprobs as a flat list[float] (#13972)
trtllm-serve, trtllm-eval, trtllm-bench: Make CLI flags take precedence over --config / --extra_llm_api_options YAML (#14812)

Feature
Upgrade NIXL to v1.0.1 and UCX to 1.21 (#14436)
Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL (#14453)
Enable FlashInfer GDN decoding kernel for Qwen3.5 (#13645)
Add per-expert LoRA support with Cutlass backend (#14801)
Reduce OpenAI stream postprocess overhead (#14708)
Add encoder CUDA graph support to llm.encode() (#14326)
Use a Triton kernel for C++ mamba hybrid state update (#14869)
Fuse masked gather + finalize-scale into one Triton kernel in DeepGemmFusedMoE (#14592)
Support KVCacheManagerV2 adjust() in single GPU + agg PyExecutor loop (#14578)
Add disk cache config for KVCacheManagerV2 (#14845)
Add Wan I2V generation example (#14981)
Add LTX-2 visual generation example (#14976)
Update flashinfer-python from 0.6.12rc2 to 0.6.12 (#14805)

Fix
Fix mamba-out-of-block error with ADP + BS=1 + disagg (#14853)
Fix XQA IMA for invalid pages with sliding window (#14459)
Propagate event loop errors to await_responses callers (#12735)
Fix Mamba replay mode accuracy issues (#14509)
Fix PyExecutor hang in disagg TP prefill (#14020)
Fix stale runtime metadata issues during MLA fallback transitions (#14049)
Fix KVCacheManagerV2 block counting correctness issues (#14725)
Canonicalize multimodal cache-key serialization to prevent hash collisions (#14800)
Fix LTX-2 audio PE padding issues (#14818)
Release KVCacheManagerV1 blocks on MAX_UTILIZATION pause (#14723)
Fix config sharing issue for Qwen3-VL (#14766)
Enforce request and buffer index lifecycle integrity (#14768)
Add nemotron-v3 as the proper nemotron-h reasoning parser (#14900)
Clamp KV pool window sizes to max_seq_len (#14905)
Fix mamba block calculation (#14524)
Add trust_remote_code=True to the LLM(...) constructor to fix various model loading issues (#14892)
Fix deep EP partial warp sync for GPT-OSS shapes (#14977)
Add warmup for trtllm-gen fmha JIT kernels (#14851)

Documentation
Add VisualGen API walkthrough example and docs page (#14685)
Add Nemotron 3 Ultra doc (#14964, #15113)

Test & Infra
Pipe stderr separately in subprocess calls to improve error reporting in Allure (#14750)
Remove obsolete tests (#14995, #14660, #14992, #14952, #14749)
Parallelize post stages: Rerun Report, Test Coverage, and AI Failure Analysis (#14528)
Relocate tests to right-sized stages (#14684)
Move non-default-feature tests to post merge (#15038)

What's Changed

[None][test] Update datasets path by @JennyLiu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14671
[None][infra] Update new .test_durations by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14661
[TRTLLM-13015][feat] drop complex visual_gen CLI example scripts by @zhenhuaw-me in https://github.com/NVIDIA/TensorRT-LLM/pull/14632
[https://nvbugs/6117811][fix] Fix XQA IMA for invalid pages with sliding window by @pengbowang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14459
[None][feat] Tune mamba config by env variables by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/14730
[None][test] Update moe backend for ctx and acceptance length env by @fredricz-20070104 in https://github.com/NVIDIA/TensorRT-LLM/pull/14803
[None][test] Update precision of previous device step time by @fredricz-20070104 in https://github.com/NVIDIA/TensorRT-LLM/pull/14809
[None][infra] Waive 12 failed cases for main in post-merge 2749 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14802
[TRTLLM-12971][infra] Fix parse classname logic in timeout result by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/14559
[https://nvbugs/6038228][fix] Propagate event loop errors to await_responses callers by @JunyiXu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/12735
[TRTLLM-12288][feat] Support Nemotron-H nvfp4 ckpt on Hopper by @JadoTu in https://github.com/NVIDIA/TensorRT-LLM/pull/14775
[TRTLLM-12596][feat] Support simple logprob format by @tongyuantongyu in https://github.com/NVIDIA/TensorRT-LLM/pull/13972
[None][fix] Stabilize Mamba replay state update by @sunnyqgg in https://github.com/NVIDIA/TensorRT-LLM/pull/14509
[None][feat] Upgrade NIXL to v1.0.1 and UCX to 1.21 by @chuangz0 in https://github.com/NVIDIA/TensorRT-LLM/pull/14436
[None][feat] Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL composite VA by @tianyuz-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14453
[TRTLLM-10947][perf] eagle3: use cudaMemcpy2DAsync custom op for hidden-state capture by @pcicotti in https://github.com/NVIDIA/TensorRT-LLM/pull/14479
[None][fix] PyExecutor Hang in Disagg TP Prefill by @jthomson04 in https://github.com/NVIDIA/TensorRT-LLM/pull/14020
[https://nvbugs/6240561][fix] Autodeploy fix the deepseek accuracy drop by @nvchenghaoz in https://github.com/NVIDIA/TensorRT-LLM/pull/14774
[#12702][feat] Autodeploy deprecate the legacy triton attention by @nvchenghaoz in https://github.com/NVIDIA/TensorRT-LLM/pull/14194
[None][test] Waive 5 failed cases for main in QA CI by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14789
[None][test] Waive 7 failed cases for main in QA CI by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14791
[https://nvbugs/6240561][fix] Fix AutoDeploy DeepSeek-R1 accuracy drop by @taylor-yb-lee in https://github.com/NVIDIA/TensorRT-LLM/pull/14793
[#14588][fix] [AutoDeploy] Fix OOM of DeepSeek-R1 NVFP4 for tp=4 by @taylor-yb-lee in https://github.com/NVIDIA/TensorRT-LLM/pull/14477
[https://nvbugs/6179761][fix] Save LTX-2 BF16 weights to speed up perf by @yibinl-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/14639
[TRTLLM-13028][doc] Add VisualGen API walkthrough example and docs page by...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine release candidate of an optimization library.