friendliai/cutlass
forked from NVIDIA/cutlass
Captured source
source ↗friendliai/cutlass
Description: CUDA Templates for Linear Algebra Subroutines
License: NOASSERTION
Stars: 0
Forks: 0
Open issues: 0
Created: 2025-07-08T10:38:38Z
Pushed: 2025-07-06T12:49:22Z
Default branch: main
Fork: yes
Parent repository: NVIDIA/cutlass
Archived: no
README: 
Overview
CUTLASS 4.1.0
_CUTLASS 4.1.0 - July 2025_
CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement. CUTLASS decomposes these "moving parts" into reusable, modular software components and abstractions.
Primitives for different levels of a conceptual parallelization hierarchy can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies their use as building blocks within custom kernels and applications.
CUTLASS has been providing CUDA C++ template abstractions for high-performance linear algebra since 2017 and these abstractions provide extensive support for a wide range of computations including mixed-precision computations, specialized data-movement (async copy) and multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16, FP32 emulation via tensor core instruction, 8b floating point types (e5m2 and e4m3), block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8), narrow integer types (4 and 8b signed and unsigned integers), and binary 1b data types (where architectures allow for the native support of such data types) across NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.
To this rich ecosystem of C++ based kernel programming abstractions, CUTLASS 4 adds CUTLASS DSLs. These are Python native interfaces for writing high-performance CUDA kernels based on core CUTLASS and CuTe concepts without any performance compromises. This allows for a much smoother learning curve, orders of magnitude faster compile times, native integration with DL frameworks without writing glue code, and much more intuitive metaprogramming that does not require deep C++ expertise.
Overall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs). With the release of 4.0, we are releasing the first of these in CuTe DSL. This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy.
CuTe DSL demonstrates optimal matrix multiply and other linear algebra operations targeting the programmable, high-throughput _Tensor Cores_ implemented by NVIDIA's Ampere, Hopper, and Blackwell architectures.
We believe it will become an indispensable tool for students, researchers, and performance engineers alike — flattening the learning curve of GPU programming, rapidly prototyping kernel designs, and bringing optimized solutions into production.
CuTe DSL is currently in public beta and will graduate out of beta by end of summer 2025.
To get started quickly - please refer :
What's New in CUTLASS 4.1
CuTe DSL
- More examples demonstrating how to use CuTe DSL to write peak-performance kernels
- Blackwell Mamba2 SSD
- API updates
- for loop
- Python built-in `
range` now always generates IR and executes at runtime cutlass.rangeis advancedrange` with IR level unrolling and pipelining control- Deprecated `
cutlass.range_dynamic, please replace withrangeorcutlass.range - Experimental Added `
pipelining` control for compiler generated software pipeline code - while/if
while/ifnow by default generates IR and executes at runtime unlesscutlass.const_expr` is specified for the predicate- Deprecated `
cutlass.dynamic_expr`, please remove it - Rename mbarrier functions to reduce ambiguity
- Modify SyncObject API (
MbarrierArray,NamedBarrier,TmaStoreFence) to matchstd::barrier - Change pipeline
createfunction to take only keyword arguments, and makebarrier_storageoptional.
CUTLASS C++
- Further enhance Blackwell SM100 Attention kernels in example 77.
- Add variable sequence length support for FMHA Backward kernel.
- Add varlen test support to Backward runner.
- Codes support empty batch sequences.
- Replace
subbyte_iteratorwithcute::recast_ptrwhen constructing logical iterators/arrays. - CuTe changes:
- Rewrite ArithTuple and ScaledBasis for robustness and clarity.
- Remove buggy and kludgy
get_layoutA|B|C_MNand friends from Atoms/TiledX. - Factor out
print_latexand friends and rewrite. - Factor out
print_svgand friends and rewrite. - Support Blackwell SM100 SIMT FFMA2 kernels.
- Support residual add for implicit gemm kernels.
- Various fixes for CUTLASS C++ Python interface's EVT tracer:
- Add verifier for sm90 to report the invalid input.
- When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.
- Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.
- Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.
- Fix profiler bugs in exhaustive perf search.
- Fix incorrect cluster shape output issue when doing exhaustive search.
- Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.
Note: CUTLASS 4.x builds are known to be down on Windows platforms for all CUDA toolkits. CUTLASS team is working on a fix.
See the [CHANGELOG](https://docs.nvidia.com/cutlass/CHANGELOG.html) for details of all past releases and updates.
Performance
CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels, they exhibit nearly optimal utilization of peak theoretical…
Excerpt shown — open the source for the full document.
Notability
notability 1.0/10Routine fork of existing library