deepinfra/cutlass
forked from NVIDIA/cutlass
Captured source
source ↗deepinfra/cutlass
Description: CUDA Templates for Linear Algebra Subroutines
Language: C++
License: NOASSERTION
Stars: 0
Forks: 0
Open issues: 0
Created: 2025-02-19T01:27:57Z
Pushed: 2025-03-15T00:21:33Z
Default branch: main
Fork: yes
Parent repository: NVIDIA/cutlass
Archived: no
README: 
CUTLASS 3.8.0
_CUTLASS 3.8.0 - January 2025_
CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement cuBLAS and cuDNN. CUTLASS decomposes these "moving parts" into reusable, modular software components abstracted by C++ template classes. Primitives for different levels of a conceptual parallelization hierarchy can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies their use as building blocks within custom kernels and applications.
To support a wide variety of applications, CUTLASS provides extensive support for mixed-precision computations, providing specialized data-movement and multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16, [FP32 emulation via tensor core instruction](./examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm), 8b floating point types (e5m2 and e4m3), block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8), narrow integer types (4 and 8b signed and unsigned integers), and binary 1b data types (where architectures allow for the native support of such data types). CUTLASS demonstrates optimal matrix multiply operations targeting the programmable, high-throughput _Tensor Cores_ implemented by NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.
In addition to GEMMs, CUTLASS implements high-performance convolution via the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline. This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.
See the [Quick Start Guide](./media/docs/quickstart.md) to get started quickly.
See the [functionality docs](./media/docs/functionality.md) for a more comprehensive list of kernel level features, data types, instructions, and minimum supported by CUTLASS on each GPU architecture.
What's New in CUTLASS 3.8
CUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture. For a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.
- Support for new CuTe building blocks specifically for Blackwell architecture:
- [5th generation Blackwell Tensor Core instructions (TCGen05)](./include/cute/atom/mma_traits_sm100.hpp) via CuTe MMA atoms.
- Extensions to [Tensor Memory Accelerator](./include/cute/atom/copy_traits_sm100_tma.hpp) via CuTe Copy atoms.
- Exposure of Blackwell's new tensor memory (note: distinct from TMA) as [
tmem](./include/cute/pointer.hpp#L290) across CuTe as a first class data locale. - Exposure of [
tmem->rmem,rmem->tmemandsmem->tmem data movement instructions](./include/cute/atom/copy_traits_sm100.hpp) as copy atoms in CuTe. - [
make_tmem_copy()](./include/cute/atom/copy_traits_sm100.hpp) utility method to ease creation of tiled copies for tmem copy atoms. - Support for [new variants of LDSM on Blackwell](./include/cute/atom/copy_traits_sm100.hpp) via CuTe Copy atoms.
- Support for new CUTLASS building blocks specifically for Blackwell architecture:
- Various narrow precision [FP4, FP6, and FP8](./include/cutlass/exmy_base.h) formats as well as their [block-scaled variants NVFP4, MXFP4, MXFP6, and MXFP8](./include/cutlass/float_subbyte.h)
- [Pipelines that implement Blackwell specific synchronization](./include/cutlass/pipeline/sm100_pipeline.hpp).
- [Cluster launch control API supporting preferred and fallback cluster shapes](./include/cutlass/cluster_launch.hpp).
- Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.
- Tile schedulers using [Blackwell's Cluster Launch Control (CLC) feature](./media/docs/blackwell_cluster_launch_control.md) to implement dynamic persistence scheduling for [GEMMs](./include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp), and [stream-K](./include/cutlass/gemm/kernel/sm100_tile_scheduler_stream_k.hpp).
- Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.
- Full support for Blackwell kernels in CUTLASS 3.x API:
- [Blackwell specific kernel layers](./include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp) that
+ Implement a new warp-specialization recipe tuned specifically for Blackwell. + Leverage all the new features such as CLC based tile scheduling, preferred cluster, and TMEM based double buffering of accumulators. + Support stream-K load balancing for all kernel types everywhere via composable scheduler support.
- Blackwell collective mainloops that target the TCGen05 MMA instructions (both SS and TS) for
- [Non-block scaled data types without support for pointer array and grouped GEMM with TMA](./include/cutlass/gemm/collective/sm100_mma_warpspecialized.hpp)
- [Non-block scaled data types with support for pointer array and grouped GEMM with TMA](./include/cutlass/gemm/collective/sm100_mma_array_warpspecialized.hpp)
- [Block scaled data types without support for pointer array and grouped GEMM with TMA](./include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp)
- [Block scaled data types with support for pointer array and grouped GEMM with TMA](./include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp)
- Blackwell [collective mainloop for convolution kernels](./include/cutlass/conv/collective/sm100_implicit_gemm_umma_warpspecialized.hpp) supporting non-block scaled data types for fprop, dgrad, and wgrad.
- New [GEMM](./include/cutlass/gemm/dispatch_policy.hpp), [convolution](./include/cutlass/conv/dispatch_policy.hpp), and [epilogue](./include/cutlass/epilogue/dispatch_policy.hpp) dispatch policies for collectives, kernel layers, and builders.
- [Blackwell epilogue that supports loading accumulators from…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine fork of existing library