What does this repo signal mean?

InclusionAI (Ant Group) published inclusionAI/cuLA (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo inclusionAI/cuLA · language Python · CUDA linear algebra library for AI by inclusionAI.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

InclusionAI (Ant Group) Repo: inclusionAI/cuLA

Captured source

source ↗

GitHub/github.com/inclusionAI/cuLA

inclusionAI/cuLA repository metadata

Source ↗

published Apr 2, 2026seen Jun 5captured Jun 11http 200method plain

inclusionAI/cuLA

Description: CUDA kernels for linear attention variants, written in CuTe DSL and CUTLASS C++.

Language: Python

License: Apache-2.0

Stars: 519

Forks: 63

Open issues: 28

Created: 2026-04-02T07:05:32Z

Pushed: 2026-06-10T03:09:00Z

Default branch: main

Fork: no

Archived: no

README:

Introduction

Linear attention mechanisms reformulate standard attention to use linear-time state updates instead of quadratic pairwise interactions, making them well suited for long-context LLM workloads. Recent variants such as GLA, KDA, GDN, and Lightning Attention further improve expressiveness with gating, delta-style updates, and chunkwise decomposition.

cuLA provides hand-tuned CUDA implementations of these linear attention variants, targeting NVIDIA Blackwell (SM10X) and Hopper (SM90) GPUs. It is designed as a submodule of flash-linear-attention (FLA), sharing the same interface — adopting cuLA requires only a one-line import change. For ease of maintenance, cuLA is currently developed as a standalone library; the end goal is for users to seamlessly access these kernels through FLA. Since FLA already has a kernel dispatch mechanism in place, integration will be ready soon.

> ⚠️ Early Stage: cuLA is in its early development phase. Many kernels still have significant room for optimization, and the API may evolve. We warmly welcome contributions from the community — whether it's performance tuning, new algorithm support, bug fixes, or architectural improvements. Every contribution helps push the boundaries of linear attention on modern GPUs!

Installation

cuLA supports both Hopper (SM90) and Blackwell (SM10X) GPUs.

> Requirements (Hopper & Blackwell): Python 3.12+, CUDA Toolkit 12.9+ (SM10X support), NVCC 12.9+, PyTorch 2.9.1+

> Note: The PyTorch CUDA version must match your system CUDA Toolkit version. Check with nvcc --version and python -c "import torch; print(torch.version.cuda)".

Clone cuLA & dependencies:

git clone https://github.com/inclusionAI/cuLA.git
git submodule update --init --recursive

Install PyTorch:

pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu129

Install cuLA & dependencies:

# Install flash-linear-attention for benchmark repro
pip install -e third_party/flash-linear-attention

# Install cuLA
pip install -e . --no-build-isolation

Quick Start

KDA (Kimi Delta Attention) — Blackwell (SM10X)

Just change the import:

import torch
from cula.kda import chunk_kda #
Sample test output

tests/test_kda_e2e_compare_fla.py::test_safe_gate_chunk[B1-T63-H1-D128-...] PASSED tests/test_kda_e2e_compare_fla.py::test_safe_gate_chunk[B2-T500-H3-D128-...] PASSED tests/test_kda_e2e_compare_fla.py::test_safe_gate_chunk[B2-T1000-H3-D128-...] PASSED tests/test_kda_e2e_compare_fla.py::test_safe_gate_chunk[B3-T1024-H4-D128-...] PASSED tests/test_kda_e2e_compare_fla.py::test_safe_gate_chunk[B4-T1024-H4-D128-...] PASSED tests/test_kda_e2e_compare_fla.py::test_safe_gate_chunk[B4-T2048-H8-D128-...] PASSED tests/test_kda_e2e_compare_fla.py::test_safe_gate_chunk_varlen[...] PASSED ... ======================= 17 passed in 40.95s =======================

CUDA kernel tuning is significantly more labor-intensive than Triton — contributions from the open-source community are warmly welcomed!

## Repository Layout

See [REPO_LAYOUT.md](REPO_LAYOUT.md) for the full directory structure and a summary of each component.

## Roadmap

* [ ] Integrate into [flash-linear-attention](https://github.com/fla-org/flash-linear-attention) via FLA's kernel dispatch mechanism
* [ ] Polynomial approximation to mitigate the exponential bottleneck, as in [Flash-Attentiton-4](https://arxiv.org/abs/2603.05451).
* [ ] Larger chunk size and 2-CTA on SM10X for improved throughput.
* [ ] Continuous optimization via agentic methods such as [AVO](https://arxiv.org/abs/2603.24517).
* [ ] Support for more algorithms.
* [ ] Small B/H/S optimizations.
* [x] Support for BF16 beta input.

**Train**

* [x] Modular KDA Forward (SM10X, compatible with [Kimi CP](https://github.com/fla-org/flash-linear-attention/blob/main/fla/ops/cp/README.md))
* [x] kda chunk intra
* [x] chunk gated delta h
* [x] recompute wu
* [x] chunk fwd o

* [ ] Modular GDN Forward / Backward Kernels (compatible with [Kimi CP](https://github.com/fla-org/flash-linear-attention/blob/main/fla/ops/cp/README.md))

* [ ] Backward pass optimizations.

* [ ] Kernel-level compute-communication overlapping for CP linear attention kernels (via **nvshmem**)

**Inference**

* [x] Lightning prefill kernel (SM10X)

* [x] Lightning decode kernel (SM90 & SM10X)

* [x] Fused KDA prefill kernel (SM90)

* [ ] Fused KDA prefill kernel (SM10X)

* [ ] MTP support

* [ ] More aggressive fusion of small neighboring kernels like cumsum for inference scenarios.

## Acknowledgements

This project is inspired by [flash-linear-attention](https://github.com/fla-org/flash-linear-attention), [CUTLASS](https://github.com/NVIDIA/cutlass), [CuTe DSL](https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Flash-Attention](https://github.com/dao-ailab/flash-attention), and [FlashMLA](https://github.com/deepseek-ai/FlashMLA). We thank [FLA-org](https://github.com/fla-org) and NVIDIA for their great work.

## Citation

If you find cuLA useful, please cite it using the metadata in our [`CITATION.cff`](CITATION.cff) file:

@software{cula2026, title = {cuLA: CUDA Linear Attention}, author = {Chaofan Yu, Bowen Zeng, Hao Chen, Zhe Yang, Zhiqiang Zhang, Huan Li and Jun Zhou}, year = {2026}, url = {https://github.com/InclusionAI/cuLA} }

## Contact

If you're interested in an internship or job opportunity, feel free to reach out: **chaofanyu@gmail.com**

No CUDA experience is required as long as you're a quick learner.

For Q&A and discussion, you can join us through:

- **Slack:** [cuLA Slack Community](https://join.slack.com/t/cula-hq/shared_invite/zt-3uaacvm9y-xJwZyGueeKxZRYQlj7~hxw)
- **WeChat:** The WeChat group has exceeded 200 members and can no longer be joined via QR code. To join, please send your WeChat ID to any of the following emails and we'll invite you:...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

New repo with 518 stars, decent traction