What does this repo signal mean?

NVIDIA published NVIDIA/cutlass (C++). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo NVIDIA/cutlass · language C++. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

NVIDIA Repo: NVIDIA/cutlass

Captured source

source ↗

GitHub/github.com/NVIDIA/cutlass

NVIDIA/cutlass repository metadata

Source ↗

published Nov 30, 2017seen 5dcaptured 8hhttp 200method plain

NVIDIA/cutlass

Description: CUDA Templates and Python DSLs for High-Performance Linear Algebra

Language: C++

License: NOASSERTION

Stars: 9878

Forks: 1903

Open issues: 650

Created: 2017-11-30T00:11:24Z

Pushed: 2026-06-09T02:12:36Z

Default branch: main

Fork: no

Archived: no

README: ![ALT](./media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")

Overview

CUTLASS 4.5.2

_CUTLASS 4.5.2 - May 2026_

CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement. CUTLASS decomposes these "moving parts" into reusable, modular software components and abstractions.

Primitives for different levels of a conceptual parallelization hierarchy can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies their use as building blocks within custom kernels and applications.

CUTLASS has been providing CUDA C++ template abstractions for high-performance linear algebra since 2017 and these abstractions provide extensive support for a wide range of computations including mixed-precision computations, specialized data-movement (async copy) and multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16, FP32 emulation via tensor core instruction, 8b floating point types (e5m2 and e4m3), block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8), narrow integer types (4 and 8b signed and unsigned integers), and binary 1b data types (where architectures allow for the native support of such data types) across NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.

To this rich ecosystem of C++ based kernel programming abstractions, CUTLASS 4 adds CUTLASS DSLs. These are Python native interfaces for writing high-performance CUDA kernels based on core CUTLASS and CuTe concepts without any performance compromises. This allows for a much smoother learning curve, orders of magnitude faster compile times, native integration with DL frameworks without writing glue code, and much more intuitive metaprogramming that does not require deep C++ expertise.

Overall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs). With the release of 4.0, we are releasing the first of these in CuTe DSL. This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy.

CuTe DSL demonstrates optimal matrix multiply and other linear algebra operations targeting the programmable, high-throughput _Tensor Cores_ implemented by NVIDIA's Ampere, Hopper, and Blackwell architectures.

We believe it will become an indispensable tool for students, researchers, and performance engineers alike — flattening the learning curve of GPU programming, rapidly prototyping kernel designs, and bringing optimized solutions into production.

CuTe DSL is currently in public beta and will graduate out of beta by end of summer 2025.

To get started quickly - please refer :

What's New in CUTLASS 4.5

CuTe DSL

New features
New Block API block_copy() to simplify TMA and S2T copy. Users can ignore detail about multicast and 2CTA partition for TMA by block_copy() and need not to invoke tma_partition(). And users can remove bulk of S2T initialization to simplify S2T copy.
MXF8F6F4 mixed precision support
BlockScaled MMA now supports MXF8*MXF4 or MXF8*MXF6
Block Scaled MMA for SM120 now works on Spark
EFC broadcast semantics support
EFC epilogue functions can now broadcast and remap tensor modes via C.remap_modes[:, 0, 1] subscript syntax (where : marks a broadcast dimension and integers select source mode indices). Covers scalar broadcast, row/column broadcast, and arbitrary mode permutations (e.g. transpose). The PyTorch reference evaluator mirrors the same transformations.
Initial linter support: Improved type hints on CuTe DSL APIs to support static type checkers like MyPy
dataclasses.dataclass is now supported for JIT compilaton and cute.compile for both plain and tvm-ffi path
cute.copy now supports user specified loop unrolling
Python 3.14t is now supported with GIL enabled

Bug fixing and improvements
Improved source code correlation for profiling/debugging
Fixed an aarch64 segfault issue with tvm-ffi
Re-organization for CuTe DSL examples/tutorials for better discoverability
Fixed following issues:

https://github.com/NVIDIA/cutlass/issues/3219 https://github.com/NVIDIA/cutlass/issues/3218 https://github.com/NVIDIA/cutlass/issues/3212 https://github.com/NVIDIA/cutlass/issues/3210 https://github.com/NVIDIA/cutlass/issues/3208 https://github.com/NVIDIA/cutlass/issues/3201 https://github.com/NVIDIA/cutlass/issues/3227 https://github.com/NVIDIA/cutlass/issues/3240 https://github.com/NVIDIA/cutlass/issues/3241

Fixed Jax int64 stride divisibility issue
Fixed issues for SM120 blockscaled MMAs
added missing MXFP8MMAOP and MXF8F6F4MMAOP for sm120.

More examples of authorizing peak-performance kernels
MOE examles
A new style of grouped-gemm that aligns to torch's grouped_mm and scaled_groued_mm interface.
Expert-wise tensormap descriptor setup by a cheap helper kernel (~2us) to avoid long latency in tile switching, kernel structure is much more closer to a normal GEMM.
Compared to torch_210_cu13, very few problem has worse perf in B200.
mxfp8_2dx3d: avg 1.29 speedup;
mxfp8_2dx2d: avg 1.41 speedup;
nvfp4_2dx3d: avg 1.11 speedup;
nvfp4_2dx2d: avg 1.12 speedup (worst case 0.98)
bf16_2dx3d: avg 1.15 speedup (worst case 0.98)
bf16_2dx2d: avg 1.17 speedup (worst case 0.96)
Note: The perf is measured from torch profiler, this impl includes the helper kernel + main kernel, while torch's includes its setup kernel and cutlass_cpp main kernel.

API changes
ab_dtype is deprecated in make_trivial_tiled_mma and make_blockscaled_trivial_tiled_mma from…

Excerpt shown — open the source for the full document.