NVIDIA/cudnn-frontend v1.24.1
NVIDIA/cudnn-frontend
Captured source
source ↗published Jun 8, 2026seen 2dcaptured 11hhttp 200method exa
Release: NVIDIA/cudnn-frontend v1.24.1
- Repository: NVIDIA/cudnn-frontend | cuDNN Frontend is NVIDIA's modern, open-source entry point to the cuDNN library and a growing collection of high-performance open-source kernels. | 844 stars | Python
- Name: v1.24.1 release
- Author: @Anerudhan
- Created: 2026-06-08T19:03:06Z
- Published: 2026-06-08T19:04:16Z
cuDNN Frontend v1.24.1 Release Notes
cuDNN Frontend v1.24.1 is the recommended version for cuDNN 9.23.0 and later releases.
General Improvements 🚀 🚀
Updates to Graph API
- Rotary Position Embedding (RoPE) is now available as a cudnn operation, usable both standalone and as a preprocessing stage for the SDPA engine. See the [sample](test/python/test_oss_rope.py) for usage. RoPE fusion with SDPA requires cuDNN 9.24.0.
- SDPA backward now supports hidden dimension
d=256. Requires cuDNN 9.23.0 or later.
Open-Source Kernels 🚀 🚀
- Introduced a DSA module featuring the following DSA/CSA kernels for DsV4:
- Indexer Forward: CuTe-DSL score kernel (Q @ Kᵗ, ReLU, head reduce, ratio causal mask). Non-fused; pair with Indexer Top-K for the top-K stage.
- Indexer Top-K: SM100 CuTe-DSL radix top-K kernel with per-row `
seq_lens`. - Sparse Attention Backward: DSA backward (FlashMLA-shape, SM90/SM100).
- Sparse Indexer / Attention Score Recompute: Sparse (top-K) recomputation of indexer and attention scores for training loss.
- Dense Indexer / Attention Score Recompute: Dense (full-KV) analogues of the above.
- Indexer Backward: Three-stage pipeline (score-grad, three GEMMs, dtype cast) for sparse top-K score tensors.
- Dense Indexer Backward: Full-KV counterpart of Indexer Backward.
- Grouped GEMM GLU forward kernel with fused Hadamard transform.
Skills
- Added a new Claude skill for converting cuteDSL kernels into experimental cuDNN APIs.
Enhancements
- Noisy logging messages are now emitted only once per process.
- Convolution problems are now rejected when total filter size exceeds
INT32_MAX. - Support for ragged input order has been added for grouped GEMM weight gradients.
Bug Fixes
- Fixed an issue in the reshape operator when called with 1D tensors.
- Fixed missing
square_alphascaling in dgeglu and dswiglu. - Fixed a race condition in lazy variant-pack-template preparation observed in some single-threaded scenarios.
New Samples
- Added new samples for [memory-bound fusions](samples/cpp/membound/boolean_fusion.cpp).
Acknowledgements
The Native Sparse Attention forward-prop kernels, supporting head dim = 128 and optimized for the Blackwell architecture, were implemented in CuteDSL.
These kernels were a collaborative effort, jointly developed by: Jie Feng, Akash Mehra, Vincent Zhang, Dominik Ernst, Xinbo Zhao, Aditya Vavre, Vedaanta Agarwalla, Mingyang Wang, Anerudhan Gopal, Paul Springer, Yang Xu, and Nima Tajbakhsh.