togethercomputer/saw-int4
Shell
Captured source
source ↗togethercomputer/saw-int4
Description: Official implementation of Paper "System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving"
Language: Shell
License: MIT
Stars: 27
Forks: 3
Open issues: 2
Created: 2026-04-13T20:42:59Z
Pushed: 2026-04-17T23:05:36Z
Default branch: main
Fork: no
Archived: no
README:
saw-int4
saw-int4 is the official implementation of >
This repository implements Block Diagonal Rotation (BDR) for KV-cache quantization, along with system-level optimizations that seamlessly integrate into SGLang. The resulting system achieves near-BF16 accuracy while preserving the end-to-end performance benefits of INT4.
Contents
- [Introduction](#introduction)
- [How to run BDR](#how-to-run-bdr)
- [Get the code](#get-the-code)
- [Server requirements](#server-requirements)
- [Install BDR (sglang-fast-rotation)](#install-bdr-sglang-fast-rotation)
- [Run BDR](#run-bdr)
- [Quick demo (verify your install)](#quick-demo-verify-your-install)
- [Primary accuracy and throughput](#primary-accuracy-and-throughput)
- [Accuracy (primary)](#accuracy-primary)
- [Prepare](#prepare)
- [RUN-GPQA](#run-gpqa)
- [Accuracy results (primary)](#accuracy-results-primary)
- [Throughput and latency (primary)](#throughput-and-latency-primary)
- [Prepare (genai-bench)](#prepare-genai-bench)
- [Speed results (primary)](#speed-results-primary)
- [Ablation study (k-means, k-means + rotation)](#ablation-study-k-means-k-means--rotation)
- [Install sglang-kmeans](#install-sglang-kmeans)
- [KV calibration (ablation only)](#kv-calibration-ablation-only)
- [Ablation method matrix](#ablation-method-matrix)
- [Accuracy results (ablation)](#accuracy-results-ablation)
- [Repository layout](#repository-layout)
- [Full reproduction](#full-reproduction)
- [License](#license)
Introduction
This work studies 4-bit KV-cache quantization under real serving constraints such as paged memory layouts, regular memory access, and fused attention execution. Our primary method, BDR (block-diagonal rotation), applies a block-diagonal Hadamard rotation to the KV cache before token-wise INT4 KV-cache quantization, implemented directly inside a fork of [SGLang](https://github.com/sgl-project/sglang).
We ship two submodule branches on the same fork remote:
- [third_party/sglang-fast-rotation](third_party/sglang-fast-rotation) — Our proposed BDR implementation: fused block-diagonal rotation + INT4 KV-cache write. Use this fork for both accuracy and throughput on BF16, INT4, and BDR (the main paper numbers).
- [third_party/sglang-kmeans](third_party/sglang-kmeans) — Ablation study for kmeans, kmeans+rotation: KV dump, k-means centroids, and k-means + rotation variants. Not required to reproduce the core BDR vs BF16 vs INT4 story.
Pinned commits: [SUBMODULE_VERSIONS.md](SUBMODULE_VERSIONS.md).
How to run BDR
This section covers everything needed to run BDR on `third_party/sglang-fast-rotation`: get the code, install, and launch a server.
Get the code
git clone --recurse-submodules https://github.com/togethercomputer/saw-int4.git cd saw-int4
If you cloned without submodules: git submodule update --init third_party/sglang-fast-rotation.
Server requirements
The BDR implementation is built on top of the SGLang codebase and currently assumes the following setup:
- MHA models only — MLA and other non-MHA layouts are not supported for these KV / BDR settings.
- Prefill backend: `fa3`.
- Decode backend: `triton`.
Install BDR
cd third_party/sglang-fast-rotation/python pip install -e ".[all]" pip install --no-build-isolation "git+https://github.com/Dao-AILab/fast-hadamard-transform.git"
Run BDR
BF16 KV (baseline)
python -m sglang.launch_server \ --prefill-attention-backend fa3 \ --decode-attention-backend triton \ --model-path "Qwen/Qwen3-4B-Thinking-2507" \ --port 30000 \ --kv-cache-dtype auto
Original INT4 KV
python -m sglang.launch_server \ --prefill-attention-backend fa3 \ --decode-attention-backend triton \ --model-path "Qwen/Qwen3-4B-Thinking-2507" \ --port 30000 \ --kv-cache-dtype int4
BDR (block diagnoal rotation on K)
HADAMARD=1 HADAMARD_ORDER=128 python -m sglang.launch_server \ --prefill-attention-backend fa3 \ --decode-attention-backend triton \ --model-path "Qwen/Qwen3-4B-Thinking-2507" \ --port 30000 \ --kv-cache-dtype int4
For the full env variable reference, and the complete mode matrix, see [docs/bdr_env_vars.md](docs/bdr_env_vars.md).
Quick demo (verify your install)
With the server running in any of the three modes above, run the smoke-test script from the repository root:
pip install openai # if not already installed python scripts/bdr_smoke_test.py --port 30001 --model Qwen/Qwen3-4B-Thinking-2507
The script sends a GPQA sample question to the server and streams the response.
Server : http://0.0.0.0:30000/v1 Model : Qwen/Qwen3-4B-Thinking-2507 --- Prompt (GPQA sample) --- Answer the following multiple choice question..... ... --- Response ---
Primary accuracy and throughput
Accuracy (simple-evals / GPQA) and throughput (genai-bench) both use `third_party/sglang-fast-rotation`; server setup is in [How to run BDR](#how-to-run-bdr). Accuracy model: `Qwen/Qwen3-4B-Thinking-2507`. Throughput model: `Qwen/Qwen3-8B` (override MODEL_PATH in scripts if you align checkpoints).
Accuracy (primary)
Prepare
Prerequisite (GPQA client): [openai/simple-evals](https://github.com/openai/simple-evals) is included as a submodule at `third_party/simple-evals`.
git submodule update --init --checkout third_party/simple-evals cd third_party/simple-evals mkdir -p simple_evals touch simple_evals/__init__.py pip install openai pandas requests jinja2 tqdm numpy
Add a local model alias once in third_party/simple-evals/simple_evals.py inside the models = { ... } dictionary so simple-evals and set max_tokens=32768:
"qwen3_4b": ChatCompletionSampler( model="Qwen/Qwen3-4B-Thinking-2507", system_message=OPENAI_SYSTEM_MESSAGE_API, max_tokens=32768, ),
RUN-GPQA
With simple-evals installed and the SGLang server already up (start it in the desired mode from [Run BDR](#run-bdr), using `Qwen/Qwen3-4B-Thinking-2507` as the model), point the…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine repo with very low stars