ReleaseNVIDIANVIDIApublished May 20, 2026seen 5d

NVIDIA/cudnn-frontend v1.24.0

NVIDIA/cudnn-frontend

Open original ↗

Captured source

source ↗
published May 20, 2026seen 5dcaptured 10hhttp 200method plain

v1.24.0 release

Repository: NVIDIA/cudnn-frontend

Tag: v1.24.0

Published: 2026-05-20T05:08:25Z

Prerelease: no

Release notes:

cuDNN Frontend v1.24.0 Release Notes

cuDNN Frontend v1.24.0 is the recommended version for cuDNN 9.22.0 and later releases.

General Improvements 🚀 🚀

Updates to Graph API

  • Rotary Position Embedding (RoPE) is now available as a cudnn operation, usable both standalone and as a preprocessing stage for the SDPA engine. See the [sample](test/python/test_oss_rope.py) for usage. RoPE fusion with SDPA requires cuDNN 9.24.0.
  • SDPA backward now supports hidden dimension d=256. Requires cuDNN 9.23.0 or later.

Open-Source Kernels 🚀 🚀

  • Introduced a DSA module featuring the following DSA/CSA kernels for DsV4:
  • Indexer Forward: CuTe-DSL score kernel (Q @ Kᵗ, ReLU, head reduce, ratio causal mask). Non-fused; pair with Indexer Top-K for the top-K stage.
  • Indexer Top-K: SM100 CuTe-DSL radix top-K kernel with per-row `seq_lens`.
  • Sparse Attention Backward: DSA backward (FlashMLA-shape, SM90/SM100).
  • Sparse Indexer / Attention Score Recompute: Sparse (top-K) recomputation of indexer and attention scores for training loss.
  • Dense Indexer / Attention Score Recompute: Dense (full-KV) analogues of the above.
  • Indexer Backward: Three-stage pipeline (score-grad, three GEMMs, dtype cast) for sparse top-K score tensors.
  • Dense Indexer Backward: Full-KV counterpart of Indexer Backward.
  • Grouped GEMM GLU forward kernel with fused Hadamard transform.

Skills

  • Added a new Claude skill for converting cuteDSL kernels into experimental cuDNN APIs.

Enhancements

  • Noisy logging messages are now emitted only once per process.
  • Convolution problems are now rejected when total filter size exceeds INT32_MAX.
  • Support for ragged input order has been added for grouped GEMM weight gradients.

Bug Fixes

  • Fixed an issue in the reshape operator when called with 1D tensors.
  • Fixed missing square_alpha scaling in dgeglu and dswiglu.
  • Fixed a race condition in lazy variant-pack-template preparation observed in some single-threaded scenarios.

New Samples

  • Added new samples for [memory-bound fusions](samples/cpp/membound/boolean_fusion.cpp).

Acknowledgements

The Native Sparse Attention forward-prop kernels, supporting head dim = 128 and optimized for the Blackwell architecture, were implemented in CuteDSL.

These kernels were a collaborative effort, jointly developed by: Jie Feng, Akash Mehra, Vincent Zhang, Dominik Ernst, Xinbo Zhao, Aditya Vavre, Vedaanta Agarwalla, Mingyang Wang, Anerudhan Gopal, Paul Springer, Yang Xu, and Nima Tajbakhsh.

Notability

notability 6.0/10

cuDNN frontend update, routine but important.