ReleaseNVIDIANVIDIApublished Apr 29, 2026seen 1w

NVIDIA/cudnn-frontend v1.23.0

NVIDIA/cudnn-frontend

Open original ↗

Captured source

source ↗
published Apr 29, 2026seen 1wcaptured 2dhttp 200method plain

v1.23.0-release

Repository: NVIDIA/cudnn-frontend

Tag: v1.23.0

Published: 2026-04-29T18:04:53Z

Prerelease: no

Release notes: cuDNN Frontend v1.23.0 is the recommended version for cuDNN 9.21.0 and later releases.

cudnn-frontend now has pip wheels for python 3.14t.

New APIs 🚀 🚀

Causal Conv1d

  • Depthwise causal 1-D convolution with optional fused silu activation (requires cuDNN 9.22.0): y = activation(conv1d_causal(x, w) + b) Supports forward and backward passes with torch.autograd and torch.compile. (Not supported on Windows yet)

Updates to Graph API

Transpose (requires cuDNN 9.22.0)

  • Added new Graph::transpose with Transpose_attributes(permutation, optional compute dtype, name)

Slice (requires cuDNN 9.22.0)

  • Extend Slice_attributes with set_strides for per-axis slice steps; strided slices update inferred output shape and strides accordingly.
  • Python: pygraph.slice now honors each dimension's slice.step

Concatenate (requires cuDNN 9.22.0)

  • Extend Concatenate_attributes with set_in_place_index (optional). When unset, concatenate runs out-of-place per backend rules.

Reshape (requires cuDNN 9.22.0)

  • Introduce ReshapeMode_t(VIEW_ONLY,LOGICAL) and Reshape_attributes::set_reshape_mode so reshapes can select view-style vs lexicographic logical reshape.

Compile-time constants (requires cuDNN 9.22.0)

  • Added cudnn.scalar_type(RUNTIME_PARAM,COMPILE_TIME_CONST) and Graph::tensor(scalar, ScalarType) overloads, so scalars can be execution-time variant-pack inputs or constants embedded in the plan.
  • Tensor_attributes can be marked as a compile-time constant or a normal runtimepass-by-value scalar;

Open source kernels 🚀 🚀

  • [GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_srelu): High-performance implementation of squared-ReLU fused with GEMM.
  • [GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_dsrelu): High-performance implementation of dsquared-ReLU fused with GEMM.
  • [Grouped GEMM + GLU + Hadamard](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_glu_hadamard): Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
  • [Grouped GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_srelu): Contiguous grouped squared-ReLU GEMM for MoE workloads.
  • [Grouped GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_dsrelu): Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
  • [RMSNorm + RHT + amax](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/rmsnorm_rht_amax): A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.

Fix block-scale quantize The scale tensor uses a 128x4 reordered layout (TensorReordering_t::F8_128x4). When the reordering type is set on the scale tensor, the frontend will automatically pad the inferred scale dimensions to align with the 128x4 block structure (non-batch, non-axis dimensions are padded to multiples of 128, and the quantize axis dimension is padded to multiples of 4).

  • [GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_srelu): High-performance implementation of squared-ReLU fused with GEMM.
  • [GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_dsrelu): High-performance implementation of dsquared-ReLU fused with GEMM.
  • [Grouped GEMM + GLU + Hadamard](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_glu_hadamard): Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
  • [Grouped GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_srelu): Contiguous grouped squared-ReLU GEMM for MoE workloads.
  • [Grouped GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_dsrelu): Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
  • [RMSNorm + RHT + amax](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/rmsnorm_rht_amax): A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.

General Improvements ✨✨

  • Grouped GEMM APIs now default to dynamic MNKL compilation across GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0 to restore the previous M-only dynamic behavior.
  • Grouped GEMM wgrad wrapper APIs now support caller-provided output buffers (wgrad_tensor for dense, wgrad_ptrs for discrete)
  • Unused internal c_tensor removed from Grouped GEMM quant path

Bug fix 🐛

  • Grouped GEMM GLU bias compilation issue for 64B-aligned inputs with dynamic MNKL
  • Fix an issue with dropout in Blackwell when cudnn frontend 1.21 version is used with cudnn backend 9.21 and 9.22.

Benchmarking 📊

  • Updated the benchmark results for the SDPA improvements. Added Kimi-K2.6, LTX-2, Qwen 2.5 , Wan2.2 to the benchmark results page.

Acknowledgements:

  • Thanks @haowen-han for fixing a bug in the block-scale matmul sample.

Notability

notability 4.0/10

Routine library version bump.