Consistency diffusion language models: Up to 14x faster inference without sacrificing quality
Captured source
source ↗Consistency diffusion language models: Up to 14x faster inference without sacrificing quality
⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →
Introducing Together AI's new look →
🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →
⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →
📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →
🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →
All blog posts
Research
Published 2/19/2026
Consistency diffusion language models: Up to 14x faster inference without sacrificing quality
Authors
Minseo Kim, Chenfeng Xu, Coleman Richard Charles Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, Amir Gholami | Seoul National University, University of California, Berkeley, Together AI
Table of contents
40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...
Summary
We introduce consistency diffusion language models (CDLM), which accelerates diffusion language model inference by combining consistency-based multi-token finalization with block-wise KV caching, achieving up to 14.5x latency speedups on math and coding tasks.
Diffusion Language Models (DLMs) are emerging as a promising alternative to autoregressive (AR) LMs. Instead of generating one token at a time, DLMs iteratively refine a partially masked sequence over multiple sampling steps, gradually transforming a fully masked sequence into clean text. This refinement process creates a compelling opportunity: it enables parallel generation, allowing the model to finalize multiple tokens per iteration and potentially achieve higher throughput than AR decoding. At the same time, it can exploit bidirectional context to unlock new capabilities such as text infilling and refinement.
Visualization of inference in CDLM, naive DLMs, and autoregressive (AR) models. However, in practice, standard DLMs suffer from two major inefficiencies. [1] KV caching incompatibility under full bidirectional attention. Standard DLMs commonly use bidirectional (non-causal) attention, which requires recomputing attention over the full context at every denoising step, making inference expensive and preventing standard KV caching.
High refinement step counts to maintain quality. High-quality generation typically requires many denoising/refinement steps, often comparable to the generation length. Naively reducing the number of steps tends to degrade quality sharply.
CDLM targets both bottlenecks through a post-training recipe that makes fewer-step inference reliable while enabling exact block-wise KV caching. Preliminary: Inference in diffusion language models DLM generation is an iterative refinement over N discrete sampling steps. It transforms a fully masked sequence at time t=1 into a clean sequence at t=0. At each step, the model predicts a clean sequence distribution x0 given the current noisy sequence xt and prompt c: $p_{\theta}(\mathbf{X}_0 \mid \mathbf{X}_t, c)$ A common deterministic instantiation is low-confidence remasking: the model greedily unmasks tokens (often within blocks), finalizing the highest-confidence masked positions while keeping others masked. This leads to the decoding trajectory: $\mathcal{T}_{\mathbf{x}} = \left(\mathbf{x}_{t_0}, \mathbf{x}_{t_1}, \ldots, \mathbf{x}_{t_N}\right), \quad t_k = 1 - \frac{k}{N}$ which records how the partially refined sequence evolves step-by-step. This trajectory becomes the core object for CDLM’s training. CDLM training 1) Trajectory collection We collect trajectories offline by running inference with a DLM on domain-specific prompts. For each prompt x, we record the token-level decoding trajectory T_x, a compact hidden-state buffer H_x containing last-layer hidden states at token finalization moments, and the ground-truth text ŷ. Concretely, we adopt block-wise decoding with a generation length L_g = 256, block size B = 32, and a total of N = L_g steps (i.e., finalizing exactly one token per step within the current block). This conservative setting yields higher-quality trajectories for distillation.
Left: Teacher DLM with full bidirectional attention. Right: Student DLM with a block-wise causal mask. 2) Block-causal student and attention mask During trajectory extraction, we use a full bidirectional attention mask. In contrast, when training CDLM, we employ a block-wise causal mask that attends to the prompt, previously completed blocks, and the current decoding block. This design enables the model switch from full bidirectional to block-diffusion models (like [2]), enabling exact block-wise KV caching for finalized blocks.
Left: Block-wise decoding trajectory of the teacher (steps 0 → N ; diffusion time t : 1 → 0). Right: The student’s three-objective loss at an intermediate state y 3) Training objectives CDLM jointly minimizes three objectives: (i) Distillation loss (newly unmasked positions) For positions that become newly unmasked between an intermediate state y and its block completion y*, we match the student’s predictive distribution to the teacher’s reconstructed distribution obtained from stored hidden states. Intuition: this objective serves as the primary anchor that teaches the student to finalize multiple tokens within a block under block-causal constraints. (ii) Consistency loss (still-masked positions) We enforce within-block temporal consistency by aligning the student’s predictions at state y with its own predictions at the more informed state y* for still-masked positions, using a stop-gradient target. Intuition: this objective encourages stable multi-step transitions along the decoding trajectory. (iii) Auxiliary DLM masked-denoising loss We include a standard masked denoising objective applied to randomly masked ground-truth text. Intuition: this objective preserves the model’s general masked-token prediction capability and helps retain reasoning behavior, particularly on mathematical tasks. 4) Inference At inference time, CDLM decodes in a block-wise autoregressive manner, reusing the KV cache for the prompt and all previously finalized blocks. Within each block, we apply confidence-thresholded parallel finalization. [3] We also adopt early stopping once an end-of-text token appears in the current block. We intentionally avoid additional heuristics that introduce extra hyperparameters (e.g., inter-block parallelism with…
Excerpt shown — open the source for the full document.
Notability
notability 8.0/10HN front page, notable speedup claim