What does this release signal mean?

NVIDIA published NVIDIA/TensorRT-LLM v1.3.0rc19 (NVIDIA/TensorRT-LLM). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: NVIDIA's TensorRT-LLM inference library, version 1.3 release candidate. · v1.3.0rc19 Repository: NVIDIA/TensorRT-LLM Tag: v1.3.0rc19 Published: 2026-06-23T16:49:25Z Prerelease: yes Release notes: - Known Issues - Llama 3.1 8B FP8 can hang.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

NVIDIA Release: NVIDIA/TensorRT-LLM v1.3.0rc19

Captured source

source ↗

GitHub/github.com/NVIDIA/TensorRT-LLM

NVIDIA/TensorRT-LLM v1.3.0rc19

Source ↗

published Jun 23, 2026seen 2dcaptured 2dhttp 200method plain

v1.3.0rc19

Repository: NVIDIA/TensorRT-LLM

Tag: v1.3.0rc19

Published: 2026-06-23T16:49:25Z

Prerelease: yes

Release notes:

Known Issues
Llama 3.1 8B FP8 can hang during the autotuner warmup on GB200.

Model Support
Support NVIDIA Wan2.2-T2V quantized checkpoints (#15093)
Enable MTP for Step-3.7 NVFP4 and port Step-3.7VL vision tower to TRT-LLM modules (#14926)
Support T5 and BART in the PyTorch backend (#13919)
Support MiniMax-M3 in the PyTorch backend (#15292)

API
Align VisualGen serve request schema with VisualGenParams (#14733)
Support multi-item scoring in LLM.encode (#14693)
Drop legacy --extra_visual_gen_options CLI alias (#15262)

Feature
Enable TRTLLM MoE backend for Nemotron-H BF16 checkpoint (#14944)
Add async Ulysses pipeline (enabled for LTX-2 and WAN) (#13978)
Make TrtllmGenAttention the default decode backend on Blackwell+ (#14618)
Skip redundant data expand in DeepGemmFusedMoE via fused expand+quant Triton kernel (#14591)
Add Prometheus metrics for prompt cache, speculative decoding, perplexity, and batch occupancy (#12636)
Add Indexer TopK single-block / multi-pass radix implementation (#14268)
Enable gen-only speculative decoding for disagg setups (#14546)
Support EAGLE3 dynamic trees on Blackwell (#12958)
Add CUDA graph support for per-expert LoRA in Cutlass backend (#14881)
Add support for beam search in disaggregated serving (#14876)
Add maximal LLMAPI capture in usage telemetry (#14398)
Optimize Qwen2.5/3/3.5-VL performance (#11943)
Add skip-softmax TMA-load + sync-MMA warp-specialized context FMHA for sm_120/sm_121 (#15163)
Enable TRTLLM cross attention backend (#15345)
Support per-request mm_processor_kwargs for Qwen3-VL (#14702)
Add prefetch_reuse_blocks and configurable prefetch count (#15149)
Add MegaMoECuteDsl NVFP4 MoE backend (#14608)
Make EAGLE3 honor sampling params by default (#14745)
Add multiple FMHA library support to TRTLLM attention backend (#15204)
Add checkpointing variant of replay for MTP for mamba models (#14203)

Fix
Remove redundant TikTokenTokenizer shim from Kimi-K2.5 input processor (#14741)
Rename misnamed tunable_fp4_quantize kwarg and add real SF-swizzle control (#15002)
Gate FlashInfer GDN kernels to supported configurations (#15094)
Count DSA indexer K-cache correctly as UINT8 in KV cache size estimate (#15088)
Select CUTLASS MoE backend on non-Blackwell SMs for Qwen3.5-35B-A3B FP8 (#15081)
Fix SageAttention kernel regression by using static scheduler (#15047)
Fall back to local cache when loading tokenizer for gated models (#12998)
Fix PyExecutor FPM iteration timing (#14922)
Register multimodal placeholders for Qwen3.5 MoE VLM serving (#15079)
Fix and unwaive Nemotron-related bugs (#15085)
Guard DSA DSL atom-split against MTP draft next (#14891)
Scope disagg-ctx cache-transfer quorum vote to TP instead of WORLD (#15136)
Clear workspace in run_mla_generation to avoid illegal memory access (#15173)
Fix MAX_UTILIZATION reuse token budget (#15066)
Add kv_transfer_timeout_ms to avoid timeout (#15152)
Preserve ip:port for trtllm-serve visual-gen (#14355)
Fix guided decoding (xgrammar) + EAGLE-3 + draft_len_schedule crash during CUDA graph capture (#15023)
Stabilize Mamba replay state update (#14841)
Fix max_context_length value for attention workspace sizing (#15156)
Fix issue where host KV cache usage would double when speculative decoding is used (#14373)
Disable NCCL_SYMMETRIC tactic on GB10 (DGX Spark) (#12902)
Fix attentionOp FP8 MLA KV-reuse workspace calculation (#14852)
Fix beam search log_probs non-determinism with batch_size > 1 (#15125)
Forward secondary_offload_min_priority to KVCacheManager in PyTorch executor (#13768)
Enable multi-block mode for XQA HMMA spec-dec (#15312)
Fix TinyGEMM barrier bug (#15338)
Fix stale sparse attention kwargs (#15460)
Fix CppMambaHybridCacheManager to handle dp dummy request (#15054)
Fix embedding vocab mask for rejection sampling in Kimi-K2.5 (#15233)

Documentation
Add FLUX visual generation examples (#14987)
Add Qwen3.5 deployment guide doc (#15111)
Fix stale --disable_xqa reference in legacy docs (#13395)
Add Cache-DiT documentation (#15268)

Benchmark
Weight trtllm-bench AR/AL averages by output length (#14998)

Test & Infra
Add accuracy tests for nemotron-v3-ultra (#14808)
Remove TestLlama4ScoutInstruct tests (#15144)
Require minimum of 4 GPUs in llm_perf_core.yml and add new performance tests (#15090)
Add DFlash coverage for Qwen3.5 MoE variant (#15132)
Add e2e example tests for flux1/2, ltx2, wan_i2v, and cosmos3 (#15126)
Enable disagg cancellation stress test (#15174)
Fix periodic-junit in unittest pytest (#14075)
Update K2.5 and GLM-5 into CI perf test (#14960)
Add Qwen3-32B FP8 disagg stress test (#14278)
Sunset old disagg test cases for the QA side (#15290)
Add e2e Tensor Parallel LPIPS tests for VisualGen (#15208)
Remove TensorRT performance baseline and update to PyTorch only (#15256)
Add integration tests for MoE LoRA and bugfixes (#15271)

What's Changed

[None][infra] Waive TestQwen3NextInstruct nvfp4 cases by @mzweilz in https://github.com/NVIDIA/TensorRT-LLM/pull/15086
[https://nvbugs/6248757][fix] Avoid running all reduce in aux stream by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14917
[https://nvbugs/6221483][fix] AutoDeploy: Fix Eagle metadata host syncs by @govind-ramnarayan in https://github.com/NVIDIA/TensorRT-LLM/pull/14714
[None][feat] add FLUX visual generation examples by @karljang in https://github.com/NVIDIA/TensorRT-LLM/pull/14987
[https://nvbugs/6261164][fix] AutoDeploy: Don't allocate speculative caches when speculation is off by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/15020
[https://nvbugs/6211189][fix] Lower the reference to 46.5 (matching cross-GPU empirical mean) and remove the t by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14799
[None][refactor] split VisualGen pipeline and model configs by @bobboli in https://github.com/NVIDIA/TensorRT-LLM/pull/14956
[TRTLLM-11457][feat] Async Ulysses pipeline (Enabled for LTX-2 + WAN) by @luyiyun1021 in https://github.com/NVIDIA/TensorRT-LLM/pull/13978
[TRTLLM-11548][doc] Add Qwen3.5 deployment guide doc by @nv-guomingz in https://github.com/NVIDIA/TensorRT-LLM/pull/15111
[https://nvbugs/6181383][fix] Build...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine release candidate for optimization library.