inclusionAI/cuLA
Python
Captured source
source ↗inclusionAI/cuLA
Description: CUDA kernels for linear attention variants, written in CuTe DSL and CUTLASS C++.
Language: Python
License: Apache-2.0
Stars: 519
Forks: 63
Open issues: 28
Created: 2026-04-02T07:05:32Z
Pushed: 2026-06-10T03:09:00Z
Default branch: main
Fork: no
Archived: no
README:
Introduction
Linear attention mechanisms reformulate standard attention to use linear-time state updates instead of quadratic pairwise interactions, making them well suited for long-context LLM workloads. Recent variants such as GLA, KDA, GDN, and Lightning Attention further improve expressiveness with gating, delta-style updates, and chunkwise decomposition.
cuLA provides hand-tuned CUDA implementations of these linear attention variants, targeting NVIDIA Blackwell (SM10X) and Hopper (SM90) GPUs. It is designed as a submodule of flash-linear-attention (FLA), sharing the same interface — adopting cuLA requires only a one-line import change. For ease of maintenance, cuLA is currently developed as a standalone library; the end goal is for users to seamlessly access these kernels through FLA. Since FLA already has a kernel dispatch mechanism in place, integration will be ready soon.
> ⚠️ Early Stage: cuLA is in its early development phase. Many kernels still have significant room for optimization, and the API may evolve. We warmly welcome contributions from the community — whether it's performance tuning, new algorithm support, bug fixes, or architectural improvements. Every contribution helps push the boundaries of linear attention on modern GPUs!
Installation
cuLA supports both Hopper (SM90) and Blackwell (SM10X) GPUs.
> Requirements (Hopper & Blackwell): Python 3.12+, CUDA Toolkit 12.9+ (SM10X support), NVCC 12.9+, PyTorch 2.9.1+
> Note: The PyTorch CUDA version must match your system CUDA Toolkit version. Check with nvcc --version and python -c "import torch; print(torch.version.cuda)".
Clone cuLA & dependencies:
git clone https://github.com/inclusionAI/cuLA.git git submodule update --init --recursive
Install PyTorch:
pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu129
Install cuLA & dependencies:
# Install flash-linear-attention for benchmark repro pip install -e third_party/flash-linear-attention # Install cuLA pip install -e . --no-build-isolation
Quick Start
KDA (Kimi Delta Attention) — Blackwell (SM10X)
Just change the import:
import torch from cula.kda import chunk_kda # Sample test output
tests/test_kda_e2e_compare_fla.py::test_safe_gate_chunk[B1-T63-H1-D128-...] PASSED tests/test_kda_e2e_compare_fla.py::test_safe_gate_chunk[B2-T500-H3-D128-...] PASSED tests/test_kda_e2e_compare_fla.py::test_safe_gate_chunk[B2-T1000-H3-D128-...] PASSED tests/test_kda_e2e_compare_fla.py::test_safe_gate_chunk[B3-T1024-H4-D128-...] PASSED tests/test_kda_e2e_compare_fla.py::test_safe_gate_chunk[B4-T1024-H4-D128-...] PASSED tests/test_kda_e2e_compare_fla.py::test_safe_gate_chunk[B4-T2048-H8-D128-...] PASSED tests/test_kda_e2e_compare_fla.py::test_safe_gate_chunk_varlen[...] PASSED ... ======================= 17 passed in 40.95s =======================
CUDA kernel tuning is significantly more labor-intensive than Triton — contributions from the open-source community are warmly welcomed! ## Repository Layout See [REPO_LAYOUT.md](REPO_LAYOUT.md) for the full directory structure and a summary of each component. ## Roadmap * [ ] Integrate into [flash-linear-attention](https://github.com/fla-org/flash-linear-attention) via FLA's kernel dispatch mechanism * [ ] Polynomial approximation to mitigate the exponential bottleneck, as in [Flash-Attentiton-4](https://arxiv.org/abs/2603.05451). * [ ] Larger chunk size and 2-CTA on SM10X for improved throughput. * [ ] Continuous optimization via agentic methods such as [AVO](https://arxiv.org/abs/2603.24517). * [ ] Support for more algorithms. * [ ] Small B/H/S optimizations. * [x] Support for BF16 beta input. **Train** * [x] Modular KDA Forward (SM10X, compatible with [Kimi CP](https://github.com/fla-org/flash-linear-attention/blob/main/fla/ops/cp/README.md)) * [x] kda chunk intra * [x] chunk gated delta h * [x] recompute wu * [x] chunk fwd o * [ ] Modular GDN Forward / Backward Kernels (compatible with [Kimi CP](https://github.com/fla-org/flash-linear-attention/blob/main/fla/ops/cp/README.md)) * [ ] Backward pass optimizations. * [ ] Kernel-level compute-communication overlapping for CP linear attention kernels (via **nvshmem**) **Inference** * [x] Lightning prefill kernel (SM10X) * [x] Lightning decode kernel (SM90 & SM10X) * [x] Fused KDA prefill kernel (SM90) * [ ] Fused KDA prefill kernel (SM10X) * [ ] MTP support * [ ] More aggressive fusion of small neighboring kernels like cumsum for inference scenarios. ## Acknowledgements This project is inspired by [flash-linear-attention](https://github.com/fla-org/flash-linear-attention), [CUTLASS](https://github.com/NVIDIA/cutlass), [CuTe DSL](https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Flash-Attention](https://github.com/dao-ailab/flash-attention), and [FlashMLA](https://github.com/deepseek-ai/FlashMLA). We thank [FLA-org](https://github.com/fla-org) and NVIDIA for their great work. ## Citation If you find cuLA useful, please cite it using the metadata in our [`CITATION.cff`](CITATION.cff) file:
@software{cula2026, title = {cuLA: CUDA Linear Attention}, author = {Chaofan Yu, Bowen Zeng, Hao Chen, Zhe Yang, Zhiqiang Zhang, Huan Li and Jun Zhou}, year = {2026}, url = {https://github.com/InclusionAI/cuLA} }
## Contact If you're interested in an internship or job opportunity, feel free to reach out: **chaofanyu@gmail.com** No CUDA experience is required as long as you're a quick learner. For Q&A and discussion, you can join us through: - **Slack:** [cuLA Slack Community](https://join.slack.com/t/cula-hq/shared_invite/zt-3uaacvm9y-xJwZyGueeKxZRYQlj7~hxw) - **WeChat:** The WeChat group has exceeded 200 members and can no longer be joined via QR code. To join, please send your WeChat ID to any of the following emails and we'll invite you:…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10New repo with 518 stars, decent traction