What does this release signal mean?

NVIDIA published NVIDIA/cudnn-frontend v1.23.0 (NVIDIA/cudnn-frontend). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: NVIDIA's frontend API for cuDNN deep learning operations · v1.23.0-release Repository: NVIDIA/cudnn-frontend Tag: v1.23.0 Published: 2026-04-29T18:04:53Z Prerelease: no Release notes: cuDNN Frontend v1.23.0 is the recommended.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

NVIDIA Release: NVIDIA/cudnn-frontend v1.23.0

Captured source

source ↗

GitHub/github.com/NVIDIA/cudnn-frontend

NVIDIA/cudnn-frontend v1.23.0

Source ↗

published Apr 29, 2026seen Jun 6captured Jun 11http 200method plain

v1.23.0-release

Repository: NVIDIA/cudnn-frontend

Tag: v1.23.0

Published: 2026-04-29T18:04:53Z

Prerelease: no

Release notes: cuDNN Frontend v1.23.0 is the recommended version for cuDNN 9.21.0 and later releases.

cudnn-frontend now has pip wheels for python 3.14t.

New APIs 🚀 🚀

Causal Conv1d

Depthwise causal 1-D convolution with optional fused silu activation (requires cuDNN 9.22.0): y = activation(conv1d_causal(x, w) + b) Supports forward and backward passes with torch.autograd and torch.compile. (Not supported on Windows yet)

Updates to Graph API

Transpose (requires cuDNN 9.22.0)

Added new Graph::transpose with Transpose_attributes(permutation, optional compute dtype, name)

Slice (requires cuDNN 9.22.0)

Extend Slice_attributes with set_strides for per-axis slice steps; strided slices update inferred output shape and strides accordingly.
Python: pygraph.slice now honors each dimension's slice.step

Concatenate (requires cuDNN 9.22.0)

Extend Concatenate_attributes with set_in_place_index (optional). When unset, concatenate runs out-of-place per backend rules.

Reshape (requires cuDNN 9.22.0)

Introduce ReshapeMode_t(VIEW_ONLY,LOGICAL) and Reshape_attributes::set_reshape_mode so reshapes can select view-style vs lexicographic logical reshape.

Compile-time constants (requires cuDNN 9.22.0)

Added cudnn.scalar_type(RUNTIME_PARAM,COMPILE_TIME_CONST) and Graph::tensor(scalar, ScalarType) overloads, so scalars can be execution-time variant-pack inputs or constants embedded in the plan.
Tensor_attributes can be marked as a compile-time constant or a normal runtimepass-by-value scalar;

Open source kernels 🚀 🚀

[GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_srelu): High-performance implementation of squared-ReLU fused with GEMM.
[GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_dsrelu): High-performance implementation of dsquared-ReLU fused with GEMM.
[Grouped GEMM + GLU + Hadamard](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_glu_hadamard): Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
[Grouped GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_srelu): Contiguous grouped squared-ReLU GEMM for MoE workloads.
[Grouped GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_dsrelu): Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
[RMSNorm + RHT + amax](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/rmsnorm_rht_amax): A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.

Fix block-scale quantize The scale tensor uses a 128x4 reordered layout (TensorReordering_t::F8_128x4). When the reordering type is set on the scale tensor, the frontend will automatically pad the inferred scale dimensions to align with the 128x4 block structure (non-batch, non-axis dimensions are padded to multiples of 128, and the quantize axis dimension is padded to multiples of 4).

[GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_srelu): High-performance implementation of squared-ReLU fused with GEMM.
[GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/gemm_dsrelu): High-performance implementation of dsquared-ReLU fused with GEMM.
[Grouped GEMM + GLU + Hadamard](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_glu_hadamard): Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
[Grouped GEMM + sReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_srelu): Contiguous grouped squared-ReLU GEMM for MoE workloads.
[Grouped GEMM + dsReLU](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/grouped_gemm/grouped_gemm_dsrelu): Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
[RMSNorm + RHT + amax](https://github.com/NVIDIA/cudnn-frontend/tree/main/python/cudnn/rmsnorm_rht_amax): A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.

General Improvements ✨✨

Grouped GEMM APIs now default to dynamic MNKL compilation across GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0 to restore the previous M-only dynamic behavior.

Grouped GEMM wgrad wrapper APIs now support caller-provided output buffers (wgrad_tensor for dense, wgrad_ptrs for discrete)

Unused internal c_tensor removed from Grouped GEMM quant path

Bug fix 🐛

Grouped GEMM GLU bias compilation issue for 64B-aligned inputs with dynamic MNKL

Fix an issue with dropout in Blackwell when cudnn frontend 1.21 version is used with cudnn backend 9.21 and 9.22.

Benchmarking 📊

Updated the benchmark results for the SDPA improvements. Added Kimi-K2.6, LTX-2, Qwen 2.5 , Wan2.2 to the benchmark results page.

Acknowledgements:

Thanks @haowen-han for fixing a bug in the block-scale matmul sample.

Notability

notability 4.0/10

Routine library version bump.