NVIDIA/cudnn-frontend v1.22.0
NVIDIA/cudnn-frontend
Captured source
source ↗published Apr 3, 2026seen 5dcaptured 10hhttp 200method plain
v1.22.0-release
Repository: NVIDIA/cudnn-frontend
Tag: v1.22.0
Published: 2026-04-03T02:24:29Z
Prerelease: no
Release notes:
cuDNN Frontend v1.22.0 Release Notes
cuDNN Frontend v1.22.0 is the recommended version for cuDNN 9.20.0 and later releases.
General Improvements 🚀 🚀
- Introducing PyTorch custom operator wrapping cuDNN's Scaled Dot-Product Attention (SDPA). `
scaled_dot_product_attention` as the public entry point, closely
matching the signature of `torch.nn.functional.scaled_dot_product_attention`.
def scaled_dot_product_attention( query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, attn_mask: Optional[torch.Tensor] = None, dropout_p: float = 0.0, is_causal: bool = False, scale: Optional[float] = None, enable_gqa: bool = False, *, diagonal_alignment: int = 0, left_bound: int = -1, right_bound: int = -1, seq_len_q: Optional[torch.Tensor] = None, seq_len_kv: Optional[torch.Tensor] = None, cumulative_seq_len_q: Optional[torch.Tensor] = None, cumulative_seq_len_kv: Optional[torch.Tensor] = None, ) -> torch.Tensor:
- Introduce a preindexed execute method, that reduces the CPU execution overhead.
- Improve the reproducer tool to report and reproduce SDPA failures for fp8 data types as well.
- 🕒 We will be rolling out new native custom torch ops in upcoming releases – stay tuned! 😃
Open-Source Kernels 🚀 🚀
- Blackwell sdpa bprop kernel supporting head dim = 256, written in cuteDSL. Support added through the torch-op above or callable as a standalone API. See [samples](test/python/fe_api/test_sdpa_bwd.py) for the API usage. Requires
nvidia-cutlass-dsl[cu13]==4.4.1
- Grouped Gemm + quantize kernels now support dynamic shape and layout. This is controllable via an environment toggle.
- Grouped Gemm + Glu/Swiglu now supoprt optional bias fusion in both dense and discrete modes, including partial‑N support and optional bias‑gradient generation for discrete backward paths.
Updates:
- fp8 datatype with packed variable sequences (THD) is no longer supported for SM90 (Hopper) architecture.
- Fix an issue where sdpa fp8 was failing when used with cuda toolkit 12.9
Acknowledgements:
Blackwell sdpa bprop kernel supporting head dim = 256, written in cuteDSL kernel was jointly developed by Shengbin Di, Yuxi Chi, and Linfeng Zheng in close collaboration with Alibaba. We would like to extend special thanks to the core contributors from Alibaba: Siyu Wang, Haoyan Huang, Lanbo Li, Yun Zhong, Man Yuan, Minmin Sun, Yong Li, and Wei Lin for their significant contributions to this work.
Notability
notability 3.0/10Routine library release, low community traction.