Model-Preserving Adaptive Rounding with YAQA
Captured source
source ↗Model-Preserving Adaptive Rounding with YAQA
⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →
Introducing Together AI's new look →
🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →
⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →
📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →
🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →
All blog posts
Research
Published 6/5/2025
Model-Preserving Adaptive Rounding with YAQA
Authors
Albert Tseng, Zhaofeng Sun, and Chris De Sa
Table of contents
40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...
Links in this article
Paper GitHub repo Precomputed Hessians Prequantized models
We’re excited to announce YAQA (Yet Another Quantization Algorithm), a new weight-only LLM post-training quantization method that quantizes models to directly preserve the original model’s outputs. YAQA is quantizer-agnostic, meaning that it can be used with both hardware-accelerated datatypes and special memory-bound quantizers like QTIP . Across a wide range of models and quantizers, YAQA consistently reduces the KL divergence to the original model by >30% over existing rounding algorithms, resulting in state of the art performance on downstream tasks. Post-Training Quantization The main goal of post-training quantization is to produce a smaller model whose outputs are as close as possible to the original model’s. Formally, given some high-precision model parameters $\theta^*$ and a set of representable low-precision points $C$, we want to minimize the distance between the original model outputs $M(\theta^*, X)$ and quantized model $M(\theta, X)$ over some inputs $X$ sampled from an input distribution $\mathcal{D}$ and model architecture $M$: $$\hat \theta \gets \text{argmin}_{\theta \in C} \mathbb{E}_{X \sim \mathcal{D}} D_{\text{KL}}(M(\theta^*, X) \| M(\theta, X))$$ Unfortunately, exactly solving $\hat \theta$ is intractable. Most existing quantization algorithms such as LDLQ, GPTQ, and AWQ instead try to solve a proxy layerwise minimization problem by independently minimizing the immediate activation error for each linear layer in $M$. For a linear layer $y=xW^T$, this can be written as $$\text{argmin}_{W \in C} \mathbb{E}_{x\in\mathbb{R}^n \sim \mathcal{D}}\|x(W^* - W)^T\|_F^2 = \text{argmin}_{W \in C} \mathbb{E}_{x \sim \mathcal{D}} \ \text{tr}((W^* - W)x^Tx(W^* - W)^T).$$ Although this objective works well in practice, it does not consider the effect of quantization on layers after $W$ and vice versa. As such, minimizing it does not necessarily reduce the KL to the original model. A Better Proxy Error Instead of minimizing the immediate layerwise activation error, YAQA directly minimizes the KL by using a near-optimal Kronecker-factored approximation of each linear layer’s Hessian with respect to the KL. If this seems complicated, don’t worry, it’s not nearly as bad as it sounds. Consider the second order expansion of the KL minimization problem for a linear layer $W \in \mathbb{R}^{m \times n}$: $$\text{argmin}_{W \in C} \mathbb{E}_{X \sim \mathcal{D}} D_{\text{KL}}(M(\Theta^*, W^*, X) \| M(\Theta^*, W, X)) \approx \frac{1}{2} (W - W^*)^T (\nabla_{W^*}^2 L) (W - W^*)$$ Here, $L$ is the KL divergence to the original model, so we can ignore first order terms since $\nabla_{W^*} L = 0$ (the KL of a distribution to itself is 0). This leaves $H = \nabla_{W^*}^2 L$, or the full Hessian of the linear layer. $H \in \mathbb{R}^{mn \times mn}$ has $O(10^{12})$ elements, which makes it too large to manifest. However, due to some nice properties of the KL divergence, we can tractably compute Hessian-vector products with $H$, which lets us find low-rank approximations of $H$ that we can subsequently use for quantization. YAQA makes use of two well known facts from linear algebra to quantize with $H$. First, the Hessian of the $KL$ is the Fisher Information Matrix (FIM). The FIM is given by $\mathbb{E}[\text{vec}(\nabla_{W^*} \ell) \text{vec}(\nabla_{W^*} \ell)^T]$, where $\nabla_{W^*} \ell$ can be obtained by backpropping through a specially constructed loss. The important thing here is that individual samples of the FIM’s expectation can be computed with the cost of a single backprop, which means we can tractably compute Hessian-vector products with the FIM. Second, we utilize the fact that the Kronecker product of two matrices $A \otimes B$ is equivalent to a reshaped rank-1 product. This means that we can find the optimal Kronecker-factored approximation of $H \approx H_O \otimes H_I$ by finding the optimal rank-1 approximation of $H$, which can be done with power iteration and the aforementioned Hessian-vector products. Finding $H_O$ and $H_I$ We wish to find $H_O$ and $H_I$ such that $H \approx H_O \otimes H_I$. Since $H_O \otimes H_I$ is a reshaped rank-1 matrix, we can find the optimal $H_O, H_I$ that minimizes $$\|H - H_O \otimes H_I\|_F^2$$ with power iteration. Power iteration computes the following updates $$H_O \gets \frac{HH_I}{\|H_I\|_F^2} \hspace{2cm} H_I \gets \frac{H^T H_O}{\|H_O\|_F^2}$$ Where $H$ is reshaped from $(mn, mn) \to (m^2, n^2)$. The convergence rate of power iteration is a function of the spectral gap of $H$, which is empirically high for LLMs. In YAQA, we present two different Hessian sketches for $H_O$ and $H_I$. Our first sketch (sketch A in the paper) uses a biased, low variance estimate of $H$ for power iteration. Our second sketch (B in the paper), which we describe below, uses an unbiased but higher variance estimate of $H$ for power iteration. B generally produces better models since it is unbiased, at the cost of being more expensive than A. Recall from before that $H$ is given by the Fisher Information Matrix $$H = \mathbb{E}[\text{vec}(\nabla_{W^*} \ell) \text{vec}(\nabla_{W^*} \ell)^T],$$ where $\ell$ is a specially constructed loss (see paper for details) and the expectation is taken over independent samples. Since we are operating in LLM-land, tokens within a sequence are not independent due to attention. This means that we must calculate the expectation over entire sequences instead of tokens, which reduces the achievable sample size for a fixed compute budget. While this isn’t catastrophic, it does limit us to either running fewer rounds of power iteration with more samples or using fewer samples and doing more…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Solid research post from notable lab