microsoft/subcuber
C++
Captured source
source ↗microsoft/subcuber
Language: C++
License: MIT
Stars: 0
Forks: 0
Open issues: 1
Created: 2026-05-30T00:31:50Z
Pushed: 2026-06-22T03:27:47Z
Default branch: main
Fork: no
Archived: no
README: SubCuber ---------- SubCuber is a compiler for Strassen-like algorithms to fast CUDA code. This repository contains the source code, CUDA kernels, tests, examples, and benchmark evaluation scripts.
Requirements ------------ The CUDA build expects a CUDA toolkit installation and a C++20-capable host compiler. By default, the Makefiles use:
CUDA_HOME=/usr/local/cuda NVCC=/usr/local/cuda/bin/nvcc CXX=g++
You can override these on the command line, for example:
make CUDA_HOME=/path/to/cuda CXX=/path/to/g++
Submodules ----------------------
Update git submodules and apply CUTLASS patch
git submodule update --recursive git apply --directory cutlass/ cutlass.patch
Build kernel_runner --------------------- From the repository root, run:
make
This builds all registered runner objects and writes outputs under root-level build/:
build/ |-- kernel_runner `-- obj/kernel_runner/
Useful build variants:
# Build the default kernel runner make all # Build the runner without CUDA declarations enabled in the runner objects make no_cuda_declarations # Remove root-level kernel_runner build artifacts make clean
The build can be tuned with Make variables:
make SPLIT_COMPILE=8 make PRESUM_LEVEL_2_SPLIT_COMPILE=16 make CUDA_HOME=/usr/local/cuda-12.4
Run kernel_runner ------------------- The runner requires the GEMM problem size, data type, GPU architecture, Strassen level, iteration count, warmup count, and number of CUDA streams.
./build/kernel_runner \ --m=4096 \ --n=4096 \ --k=4096 \ --dtype=f32 \ --gpu_arch=ampere \ --strassen_level=1 \ --iterations=10 \ --warmup=2 \ --streams=7
Supported values:
--dtype=f32|f16|fp64 --gpu_arch=volta|ampere|hopper --strassen_level=0|1|2|all
Optional filtering:
./build/kernel_runner \ --m=4096 --n=4096 --k=4096 \ --dtype=f32 --gpu_arch=ampere --strassen_level=all \ --iterations=10 --warmup=2 --streams=7 \ --kernel_regex='presum'
For the full usage line:
./build/kernel_runner --help
Build Tests ----------- From the repository root, build all CUDA GoogleTest binaries with:
make -C tests
The test Makefile writes test binaries directly into tests/. You can also build one test binary by naming its target:
make -C tests test_ampere_f32_strassen_winograd_tile make -C tests test_hopper_f32_strassen_winograd_presum
Run Tests --------- Run the full test suite:
make -C tests run-tests
Run tests for one GPU family:
make -C tests run-volta-tests make -C tests run-ampere-tests make -C tests run-hopper-tests
Run one test binary through Make:
make -C tests run-test_ampere_f32_strassen_winograd_tile
Or run a compiled test binary directly:
cd tests ./test_ampere_f32_strassen_winograd_tile
Clean test artifacts:
make -C tests clean
Run Experiments --------------- The repo has scripts to run the whole evaluation pipeline. See ArtifactEval.md for details.
Current Status ----------------
We are actively working on optimizing our implementation and supporting larger cases. Here is a list of all things that we are working on for now. We are obviously happy to receive contributions from the open source community.
1. Add Cooperative Strassen GeMM Kernels for H100/H200: Currently, we only support Pingpong kernels because on our H200 system this kernel runs fastest on most cases. However, on H100, Cooperative kernels are fastest. We will soon support Cooperative Strassen GeMM Kernels.
2. Add GeMM kernels for Blackwell.
3. Overhaul Level 2 schedules to support larger Strassen family of algorithms.
Notability
notability 3.0/10Routine new repository without known traction.