Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas
Captured source
source ↗Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas
⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →
Introducing Together AI's new look →
🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →
⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →
📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →
🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →
All blog posts
Research
Published 4/21/2025
Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas
Authors
Austin Silveria, Soham Govande, Dan Fu
Table of contents
40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...
Links in this article
Part II Blog Part III Blog GitHub repo Twitter Discord We're hiring!
TL;DR: We present Chipmunk, a training-free method to accelerate diffusion transformers with hardware-aware dynamic sparsity. Chipmunk caches attention weights and MLP activations from previous steps and dynamically computes a sparse “ delta ” against the cached weights. Chipmunk achieves up to 3.7x faster video generation on HunyuanVideo at 720x1280 resolution for a 5s video, and 1.6x faster image generations on FLUX.1-dev at 1280x768 resolution. This blog is cross-posted to the Sandy Research blog at UCSD. Check out Part II and Part III on the Sandy Research blog for a deeper dive into the sparsity patterns and the kernels behind Chipmunk!
Images of cute chipmunks can be generated 1.37x faster! Left: Fully Dense FLUX.1-dev. Right: Ours (84% sparse attention and 70% sparse MLP) Motivation: Diffusion Transformers (DiTs) have become the standard for video generation, but the time and cost of generation keeps them out of reach of many applications. We raise two questions: (1) What do the model activations want to do? (2) What does the hardware want to do? We then use these insights to design hardware-friendly algorithms that maximize quality per unit of generation time. In this post, we unpack: Slow-Changing, Sparse Activations: DiT activations for MLP and attention change slowly across steps, and they are naturally sparse. Cross-Step Deltas: Because of the slow changing activations and natural sparsity, reformulating them to compute cross-step deltas make them even sparser. Hardware-Aware Sparsity Pattern: For both attention and MLP, we can pack dense shared memory tiles from non-contiguous columns in global memory. We open-source fast kernels for this!
But first, a preview of our results:
Hunyuan Latency Speedup VBench Quality VB Semantic VB Total Resolution Sparsity FlashAttention-3 1030s 1x 85.09% 75.82% 83.24% 720x1280x129 0% Sliding Tile Attention (Training-Free) 945s -> 527s 1.79x 84.63% 73.83% 82.46% 768x1280x117 58% Chipmunk (Training-Free) 1030s -> 477s 2.16x 84.60% 76.29% 82.94% 720x1280x129 82% * Chipmunk + Step Caching (Training-Free) 1030s -> 277s 3.72x 84.22% 75.60% 82.50% 720x1280x129 87%
- 93% sparsity on 44 out of 50 steps for an average of 82% sparsity.
FLUX.1-dev* (bf16) ImageReward MLP Sparsity Attn Sparsity Speedup Baseline (with FlashAttention-3) 76.6% 0% 0% 1x Chipmunk 80.2% 70% 83.5% 1.37x Chipmunk + Step Caching 78.0% 70% 83.5% 1.63x
These FLUX.1-dev numbers were evaluated on 1280x768 images, and we’ve found that if we increase image size to 2304x1280, we can get speedups of up to 1.65x per-image without stacking on top of step caching methods, and 1.9x with step caching! We’ve also found that we can sparsify FP8 Flux to get a 1.1x end-to-end speedup over the fastest open-source implementation. Slow-Changing, Sparse Activations Chipmunk exploits two simple observations about diffusion transformers: Activations move slowly: In each step a Diffusion Transformer (DiT) denoises a latent noise vector. This noise vector changes slowly across successive steps in the diffusion process – and so do the per-layer activations . Activations are sparse: In attention, it is common to see queries place a very large percentage of their attention probability mass on a small subset of keys–this means that the output will mostly be made up of the small subset of associated rows of $V$. And in MLP, previous works have observed significant sparsity in the intermediate activations of both ReLU and GeLU -based layers, meaning that the output will mostly be made up of the top activated rows of $W_2$.
Activation Deltas Across Diffusion Steps are Very Sparse Chipmunk uses these two observations to reduce the compute costs of the diffusion model – we can effectively capture nearly all the cross-step changes in the activations by only recomputing a small subset of attention and MLP. What does this mean, concretely? Let’s revisit the attention and MLP equations: Attention: $\text{softmax}(Q @ K^T) @ V)$ MLP: $\text{gelu}(x @ W_1) @ W_2)$
Both operations use a non-linearity to compute the scalar coefficients for a linear combination of value vectors. In attention, the value vectors are dynamic ($V$ is projected from the current token representation). In MLP, the value vectors are static (rows of the weights $W_2$). Thus, in attention, our outputs are a sum of scaled rows in the V matrix, and in MLP, our outputs are a sum of scaled rows in the $W_2$ matrix (the bias is one extra static vector). We can visualize these individual vectors as being summed to produce the total operation output.
Chipmunk’s key insight is that the value vectors (the colored columns of v above) change slowly, as do the scalar weights themselves (the colored values in the attention matrix above). Chipmunk caches the value vectors and the scalar weights, and dynamically chooses which ones to recompute in each step:
Given an attention/MLP output cache, an equivalent definition of a normal dense forward pass on step $n$ is the following: Subtract all of step $n-1$’s output vectors from the cache, and add all of step $n$’s new vectors. Therefore, given the natural sparsity in intermediate matrices, we can reformulate attention and MLP to compute a delta based on the previous step’s outputs. That is, we replace a subset of the output vectors and reuse the rest from the previous step. The output vectors that we replace correspond to sparsifying keys/values at the granularity of a single token in the intermediate matrix. Hardware-Efficient Sparsity Pattern The sparsity…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Solid research post from notable lab