NVIDIA/cutlass
C++
Captured source
source ↗NVIDIA/cutlass
Description: CUDA Templates and Python DSLs for High-Performance Linear Algebra
Language: C++
License: NOASSERTION
Stars: 9878
Forks: 1903
Open issues: 650
Created: 2017-11-30T00:11:24Z
Pushed: 2026-06-09T02:12:36Z
Default branch: main
Fork: no
Archived: no
README: 
Overview
CUTLASS 4.5.2
_CUTLASS 4.5.2 - May 2026_
CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement. CUTLASS decomposes these "moving parts" into reusable, modular software components and abstractions.
Primitives for different levels of a conceptual parallelization hierarchy can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies their use as building blocks within custom kernels and applications.
CUTLASS has been providing CUDA C++ template abstractions for high-performance linear algebra since 2017 and these abstractions provide extensive support for a wide range of computations including mixed-precision computations, specialized data-movement (async copy) and multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16, FP32 emulation via tensor core instruction, 8b floating point types (e5m2 and e4m3), block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8), narrow integer types (4 and 8b signed and unsigned integers), and binary 1b data types (where architectures allow for the native support of such data types) across NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.
To this rich ecosystem of C++ based kernel programming abstractions, CUTLASS 4 adds CUTLASS DSLs. These are Python native interfaces for writing high-performance CUDA kernels based on core CUTLASS and CuTe concepts without any performance compromises. This allows for a much smoother learning curve, orders of magnitude faster compile times, native integration with DL frameworks without writing glue code, and much more intuitive metaprogramming that does not require deep C++ expertise.
Overall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs). With the release of 4.0, we are releasing the first of these in CuTe DSL. This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy.
CuTe DSL demonstrates optimal matrix multiply and other linear algebra operations targeting the programmable, high-throughput _Tensor Cores_ implemented by NVIDIA's Ampere, Hopper, and Blackwell architectures.
We believe it will become an indispensable tool for students, researchers, and performance engineers alike — flattening the learning curve of GPU programming, rapidly prototyping kernel designs, and bringing optimized solutions into production.
CuTe DSL is currently in public beta and will graduate out of beta by end of summer 2025.
To get started quickly - please refer :
What's New in CUTLASS 4.5
CuTe DSL
- New features
- New Block API
block_copy()to simplify TMA and S2T copy. Users can ignore detail about multicast and 2CTA partition for TMA byblock_copy()and need not to invoketma_partition(). And users can remove bulk of S2T initialization to simplify S2T copy. - MXF8F6F4 mixed precision support
- BlockScaled MMA now supports MXF8*MXF4 or MXF8*MXF6
- Block Scaled MMA for SM120 now works on Spark
- EFC broadcast semantics support
- EFC epilogue functions can now broadcast and remap tensor modes via
C.remap_modes[:, 0, 1]subscript syntax (where:marks a broadcast dimension and integers select source mode indices). Covers scalar broadcast, row/column broadcast, and arbitrary mode permutations (e.g. transpose). The PyTorch reference evaluator mirrors the same transformations. - Initial linter support: Improved type hints on CuTe DSL APIs to support static type checkers like MyPy
- dataclasses.dataclass is now supported for JIT compilaton and cute.compile for both plain and tvm-ffi path
- cute.copy now supports user specified loop unrolling
- Python 3.14t is now supported with GIL enabled
- Bug fixing and improvements
- Improved source code correlation for profiling/debugging
- Fixed an aarch64 segfault issue with tvm-ffi
- Re-organization for CuTe DSL examples/tutorials for better discoverability
- Fixed following issues:
https://github.com/NVIDIA/cutlass/issues/3219 https://github.com/NVIDIA/cutlass/issues/3218 https://github.com/NVIDIA/cutlass/issues/3212 https://github.com/NVIDIA/cutlass/issues/3210 https://github.com/NVIDIA/cutlass/issues/3208 https://github.com/NVIDIA/cutlass/issues/3201 https://github.com/NVIDIA/cutlass/issues/3227 https://github.com/NVIDIA/cutlass/issues/3240 https://github.com/NVIDIA/cutlass/issues/3241
- Fixed Jax int64 stride divisibility issue
- Fixed issues for SM120 blockscaled MMAs
- added missing MXFP8MMAOP and MXF8F6F4MMAOP for sm120.
- More examples of authorizing peak-performance kernels
- MOE examles
- A new style of grouped-gemm that aligns to torch's grouped_mm and scaled_groued_mm interface.
- Expert-wise tensormap descriptor setup by a cheap helper kernel (~2us) to avoid long latency in tile switching, kernel structure is much more closer to a normal GEMM.
- Compared to torch_210_cu13, very few problem has worse perf in B200.
- mxfp8_2dx3d: avg 1.29 speedup;
- mxfp8_2dx2d: avg 1.41 speedup;
- nvfp4_2dx3d: avg 1.11 speedup;
- nvfp4_2dx2d: avg 1.12 speedup (worst case 0.98)
- bf16_2dx3d: avg 1.15 speedup (worst case 0.98)
- bf16_2dx2d: avg 1.17 speedup (worst case 0.96)
- Note: The perf is measured from torch profiler, this impl includes the helper kernel + main kernel, while torch's includes its setup kernel and cutlass_cpp main kernel.
- API changes
- ab_dtype is deprecated in make_trivial_tiled_mma and make_blockscaled_trivial_tiled_mma from…
Excerpt shown — open the source for the full document.