ReleaseNVIDIANVIDIApublished Mar 17, 2026seen 5d

NVIDIA/cutlass v4.4.2

NVIDIA/cutlass

Open original ↗

Captured source

source ↗
published Mar 17, 2026seen 5dcaptured 13hhttp 200method plain

CUTLASS 4.4.2

Repository: NVIDIA/cutlass

Tag: v4.4.2

Published: 2026-03-17T14:55:49Z

Prerelease: no

Release notes:

CuTe DSL

  • New features
  • CuTe DSL now supports Python 3.14 for both x86_64 and aarch64
  • Runtime Pointer/Tensor/FakeTensor now supports __cache_key__, providing a stable, hashable representation that simplifies and improves compiled function caching.
  • Bug fixing and improvements
  • Fixed Hopper FMHA causal attention performance regression on CUDA toolkit 13.1 by

optimizing mbarrier synchronization to avoid unnecessary convergence barriers.

  • Fix kernel loading race condition when multiple GPU are present in the same process in JAX.

CUTLASS C++

  • Enable Blackwell SM120f compilation of examples and exposes NVFP4/MX Grouped GEMM in the CUTLASS Profiler.

Notability

notability 5.0/10

Major CUDA library update from Nvidia