What does this release signal mean?

NVIDIA published NVIDIA/cutlass v4.4.0 (NVIDIA/cutlass). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: Notable library release, minor version update · CUTLASS 4.4.0 Repository: NVIDIA/cutlass Tag: v4.4.0 Published: 2026-02-26T04:01:52Z Prerelease: no Release notes: CuTe DSL * New features - CuTe DSL now supports CUDA.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

NVIDIA Release: NVIDIA/cutlass v4.4.0

Captured source

source ↗

GitHub/github.com/NVIDIA/cutlass

NVIDIA/cutlass v4.4.0

Source ↗

published Feb 26, 2026seen Jun 6captured Jun 11http 200method plain

CUTLASS 4.4.0

Repository: NVIDIA/cutlass

Tag: v4.4.0

Published: 2026-02-26T04:01:52Z

Prerelease: no

Release notes:

CuTe DSL

New features
CuTe DSL now supports CUDA toolkit 13.1!

+ Set up with cutlass/python/CuTeDSL/setup.sh --cu13 + Refer to https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/quick_start.html for more details

GB300 is now supported in CuTe DSL with CTK 13.1

+ Refer to SM103 batched 3xFP4 blockscaled GEMM kernel for example kernel

cute.experimental: introduce a higher-level, composable layer on top of existing CuTe DSL APIs (not a separate abstraction), which can be mixed with existing Cute DSL building blocks.

+ Fragment-free programming model: copy/dot APIs take memrefs directly instead of descriptors/fragments. + Automatic TMA descriptor generation and update insertion. + Automatic vectorization and predication for SIMT copies. + New pipeline abstraction with convenience wrappers + New Partition ops to simplify partitioning logic. + Device-side TMA descriptor allocation, initialization, and management + These examples can be found here https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/experimental

Ahead of Time (AoT) compilation is now available!

+ Refer to files under https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/cute/export for example usage

JAX support - you can now use CuTeDSL along with JAX

+ Refer to files under https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/jax for example usage

Introduced versioning support in DSL:

+ cutlass.__version__ for a string representation of DSL version + cutlass.CUDA_VERSION for a version class to tell the CUDA version used for DSL

Added CopyDsmemStoreOp to store data to distributed shared memory with explicit synchronization.
Grouped GEMM example now supports device-only problem shapes.
We allow grid carve-out without problem shapes being available on host.
Tma+LdMatrix features for loading+unpacking narrow-width types (refer to mixed_input_fmha_decode.py for example usage).
It is possible now to have customized epilogue fusion for persistent dense GEMM through a Python Epilogue Fusion Configuration (EFC) function, somewhat similar to CUTLASS C++ EVT. It also provides a PyTorch evaluator to compare the results.

More examples of authorizing peak-performance kernels
SM103 batched 3xFP4 blockscaled GEMM kernel
Mixed input FMHA decode example with support for int4 KV (int8 KV supported in 4.3)
New acc_scale grouped mixed input gemm kernel variant is introduced to deliver better performance for decoding cases.
All mixed_input_gemm examples are moved into a separate folder mixed_input_gemm. Common utility functions are also extracted into mixed_input_host_utils.py under the same folder.

Bug fixing and improvements
Fixed an issue that both branches of if are executed
Fixed cute.printf with f-string
Fixed an indexing issue of scalar tensor
Fixed small K reference check error for cta_tile_n = 256 case with overlapping accumulator optimization in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.

API changes
Deprecate get_num_tmem_alloc_cols from blackwell_helpers.py. Use the one from tmem_allocator.py instead.
Deprecate SM100_TMEM_CAPACITY_COLUMNS and SM100_TMEM_MIN_ALLOC_COLUMNS.
LdMatrix16x16x8bOp and StMatrix16x8x8bOp now require explicit transpose=True when calling __init__, to avoid ambiguity in data transposition.
LdMatrix16x16x8bOp copy traits updated to be faithful to PTX without permutations. Permuted variant is renamed to LdMatrix16x8x8bOp.
Grouped GEMM example takes the argument --host_problem_shape_available. If the argument is provided, grid is carved out based upon the host problem shapes, otherwise, we launch maximum possible SMs.
hardware_info.get_max_active_cluster support pass in specific stream to query. Useful for green context based SM partition.
group_bulk_copy_modes in async bulk copy example is now deprecated, use group_modes directly instead.
Deprecate nvvm wrapper from using nvvm enum, use str instead.
cute.arch.calc_packed_f32x2_op default enable ftz to default disable ftz
In CuTe DSL with CTK 13.1, following APIs in cutlass.cute.arch now require string literal instead of enum as argument:

+ fence_proxy + fence_view_async_tmem_op + calc_packed_f32x2_op + warp_redux_sync + atomic_add + atomic_and + atomic_or + atomic_xor + atomic_max + atomic_min + atomic_exch + atomic_cas + store + load

Use 'Advanced control file' for mixed input gemm examples for better performance.
Advanced control file is an experimental feature of CUDA compiler. The controls file contains internal compiler settings tuned for specific kernels with a specific version of CUDA toolkit to get better GPU kernel code. More details and documentation on how to create these controls files will be provided in future CUDA toolkit release. Note: The advanced compiler control file is not expected to work for kernels that it was not tuned for. There is no compatibility guarantee, and the controls file will not work for CUDA toolkit with a different version.

CUTLASS C++

Add example 93 for Blackwell low latency generation phase GQA kernel.
Flash Decoding with cluster reduction.
Kernel design details please check Readme.
Add Blackwell SM100 State Space Decomposition (SSD) kernel in example 112.
Add Hopper SM90 State Space Decomposition (SSD) kernel in example 111.
Add example 94 for Ada FP8xFP8 -> BF16 GEMM with blockwise dequantization of input matrices in the MMA loop with FP32 accumulation.
Generate additional device/kernel/threadblock files in CUTLASS include directory that add functionality to...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Notable library release, minor version update