NVIDIA/cudnn-frontend v1.22.1
NVIDIA/cudnn-frontend
Captured source
source ↗published Apr 10, 2026seen 5dcaptured 10hhttp 200method plain
v1.22.1-release
Repository: NVIDIA/cudnn-frontend
Tag: v1.22.1
Published: 2026-04-10T17:29:31Z
Prerelease: no
Release notes: cuDNN Frontend v1.22.1 is the recommended version for cuDNN 9.20.0 and later releases.
General Improvements 🚀 🚀
- Introducing PyTorch custom operator wrapping cuDNN's MoE Grouped Gemm operation.
def moe_grouped_matmul( token: torch.Tensor, weight: torch.Tensor, first_token_offset: torch.Tensor, token_index: Optional[torch.Tensor] = None, token_ks: Optional[torch.Tensor] = None, mode: str = "none", top_k: int = 1, ) -> torch.Tensor
See [test/python/test_moe_grouped_matmul_op.py](test/python/test_moe_grouped_matmul_op.py) for usage.
- 🕒 We will be rolling out new native custom torch ops in upcoming releases – stay tuned! 😃
Open-Source Kernels 🚀 🚀
- Blackwell sdpa fprop kernel supporting head dim = 256, written in cuteDSL. Support added through the torch-op above or callable as a standalone API. See [samples](test/python/fe_api/test_sdpa_fwd.py) for the API usage. Requires
nvidia-cutlass-dsl[cu13]==4.4.1
Updates:
GroupedGemmWgradSm100andgrouped_gemm_wgrad_wrapper_sm100expose the grouped GEMM weight-gradient kernel. See grouped_gemm_wgrad.html for API reference [moe_blockscaled_grouped_gemm_wgrad.py](python/cudnn/grouped_gemm/grouped_gemm_wgrad/moe_blockscaled_grouped_gemm_wgrad.py) for samples.
Acknowledgements:
Blackwell sdpa fprop kernel supporting head dim = 256, written in cuteDSL kernel was jointly developed by Shengbin Di, Yuxi Chi, and Linfeng Zheng in close collaboration with Alibaba. We would like to extend special thanks to the core contributors from Alibaba: Siyu Wang, Haoyan Huang, Lanbo Li, Yun Zhong, Man Yuan, Minmin Sun, Yong Li, and Wei Lin for their significant contributions to this work.
Notability
notability 3.0/10Routine version update of library