NVIDIA/cudnn-frontend v1.23.0
NVIDIA/cudnn-frontend
Captured source
source ↗published Apr 29, 2026seen 1wcaptured 2dhttp 200method plain
v1.23.0-release
Repository: NVIDIA/cudnn-frontend
Tag: v1.23.0
Published: 2026-04-29T18:04:53Z
Prerelease: no
Release notes: cuDNN Frontend v1.23.0 is the recommended version for cuDNN 9.21.0 and later releases.
cudnn-frontend now has pip wheels for python 3.14t.
New APIs 🚀 🚀
Causal Conv1d
- Depthwise causal 1-D convolution with optional fused silu activation (requires cuDNN 9.22.0):
y = activation(conv1d_causal(x, w) + b)Supports forward and backward passes withtorch.autogradandtorch.compile. (Not supported on Windows yet)
Updates to Graph API
Transpose (requires cuDNN 9.22.0)
- Added new
Graph::transposewithTranspose_attributes(permutation, optional compute dtype, name)
Slice (requires cuDNN 9.22.0)
- Extend
Slice_attributeswithset_stridesfor per-axis slice steps; strided slices update inferred output shape and strides accordingly. - Python:
pygraph.slicenow honors each dimension's slice.step
Concatenate (requires cuDNN 9.22.0)
- Extend
Concatenate_attributeswithset_in_place_index(optional). When unset, concatenate runs out-of-place per backend rules.
Reshape (requires cuDNN 9.22.0)
- Introduce
ReshapeMode_t(VIEW_ONLY,LOGICAL)andReshape_attributes::set_reshape_modeso reshapes can select view-style vs lexicographic logical reshape.
Compile-time constants (requires cuDNN 9.22.0)
- Added
cudnn.scalar_type(RUNTIME_PARAM,COMPILE_TIME_CONST)andGraph::tensor(scalar, ScalarType)overloads, so scalars can be execution-time variant-pack inputs or constants embedded in the plan. Tensor_attributescan be marked as a compile-time constant or a normal runtimepass-by-value scalar;
Open source kernels 🚀 🚀
- [GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_srelu): High-performance implementation of squared-ReLU fused with GEMM.
- [GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_dsrelu): High-performance implementation of dsquared-ReLU fused with GEMM.
- [Grouped GEMM + GLU + Hadamard](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_glu_hadamard): Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
- [Grouped GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_srelu): Contiguous grouped squared-ReLU GEMM for MoE workloads.
- [Grouped GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_dsrelu): Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
- [RMSNorm + RHT + amax](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/rmsnorm_rht_amax): A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size
16, and a per-CTAamaxreduction.
Fix block-scale quantize The scale tensor uses a 128x4 reordered layout (TensorReordering_t::F8_128x4). When the reordering type is set on the scale tensor, the frontend will automatically pad the inferred scale dimensions to align with the 128x4 block structure (non-batch, non-axis dimensions are padded to multiples of 128, and the quantize axis dimension is padded to multiples of 4).
- [GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_srelu): High-performance implementation of squared-ReLU fused with GEMM.
- [GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_dsrelu): High-performance implementation of dsquared-ReLU fused with GEMM.
- [Grouped GEMM + GLU + Hadamard](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_glu_hadamard): Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
- [Grouped GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_srelu): Contiguous grouped squared-ReLU GEMM for MoE workloads.
- [Grouped GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_dsrelu): Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
- [RMSNorm + RHT + amax](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/rmsnorm_rht_amax): A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size
16, and a per-CTAamaxreduction.
General Improvements ✨✨
- Grouped GEMM APIs now default to dynamic MNKL compilation across GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set
CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0to restore the previous M-only dynamic behavior.
- Grouped GEMM wgrad wrapper APIs now support caller-provided output buffers (wgrad_tensor for dense, wgrad_ptrs for discrete)
- Unused internal c_tensor removed from Grouped GEMM quant path
Bug fix 🐛
- Grouped GEMM GLU bias compilation issue for 64B-aligned inputs with dynamic MNKL
- Fix an issue with dropout in Blackwell when cudnn frontend 1.21 version is used with cudnn backend 9.21 and 9.22.
Benchmarking 📊
- Updated the benchmark results for the SDPA improvements. Added
Kimi-K2.6,LTX-2,Qwen 2.5,Wan2.2to the benchmark results page.
Acknowledgements:
- Thanks @haowen-han for fixing a bug in the block-scale matmul sample.
Notability
notability 4.0/10Routine library version bump.