RepoTogether AITogether AIpublished Apr 13, 2026seen 5d

togethercomputer/saw-int4

Shell

Open original ↗

Captured source

source ↗
published Apr 13, 2026seen 5dcaptured 11hhttp 200method plain

togethercomputer/saw-int4

Description: Official implementation of Paper "System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving"

Language: Shell

License: MIT

Stars: 27

Forks: 3

Open issues: 2

Created: 2026-04-13T20:42:59Z

Pushed: 2026-04-17T23:05:36Z

Default branch: main

Fork: no

Archived: no

README:

saw-int4

saw-int4 is the official implementation of >

This repository implements Block Diagonal Rotation (BDR) for KV-cache quantization, along with system-level optimizations that seamlessly integrate into SGLang. The resulting system achieves near-BF16 accuracy while preserving the end-to-end performance benefits of INT4.

Contents

  • [Introduction](#introduction)
  • [How to run BDR](#how-to-run-bdr)
  • [Get the code](#get-the-code)
  • [Server requirements](#server-requirements)
  • [Install BDR (sglang-fast-rotation)](#install-bdr-sglang-fast-rotation)
  • [Run BDR](#run-bdr)
  • [Quick demo (verify your install)](#quick-demo-verify-your-install)
  • [Primary accuracy and throughput](#primary-accuracy-and-throughput)
  • [Accuracy (primary)](#accuracy-primary)
  • [Prepare](#prepare)
  • [RUN-GPQA](#run-gpqa)
  • [Accuracy results (primary)](#accuracy-results-primary)
  • [Throughput and latency (primary)](#throughput-and-latency-primary)
  • [Prepare (genai-bench)](#prepare-genai-bench)
  • [Speed results (primary)](#speed-results-primary)
  • [Ablation study (k-means, k-means + rotation)](#ablation-study-k-means-k-means--rotation)
  • [Install sglang-kmeans](#install-sglang-kmeans)
  • [KV calibration (ablation only)](#kv-calibration-ablation-only)
  • [Ablation method matrix](#ablation-method-matrix)
  • [Accuracy results (ablation)](#accuracy-results-ablation)
  • [Repository layout](#repository-layout)
  • [Full reproduction](#full-reproduction)
  • [License](#license)

Introduction

This work studies 4-bit KV-cache quantization under real serving constraints such as paged memory layouts, regular memory access, and fused attention execution. Our primary method, BDR (block-diagonal rotation), applies a block-diagonal Hadamard rotation to the KV cache before token-wise INT4 KV-cache quantization, implemented directly inside a fork of [SGLang](https://github.com/sgl-project/sglang).

We ship two submodule branches on the same fork remote:

  • [third_party/sglang-fast-rotation](third_party/sglang-fast-rotation)Our proposed BDR implementation: fused block-diagonal rotation + INT4 KV-cache write. Use this fork for both accuracy and throughput on BF16, INT4, and BDR (the main paper numbers).
  • [third_party/sglang-kmeans](third_party/sglang-kmeans)Ablation study for kmeans, kmeans+rotation: KV dump, k-means centroids, and k-means + rotation variants. Not required to reproduce the core BDR vs BF16 vs INT4 story.

Pinned commits: [SUBMODULE_VERSIONS.md](SUBMODULE_VERSIONS.md).

How to run BDR

This section covers everything needed to run BDR on `third_party/sglang-fast-rotation`: get the code, install, and launch a server.

Get the code

git clone --recurse-submodules https://github.com/togethercomputer/saw-int4.git
cd saw-int4

If you cloned without submodules: git submodule update --init third_party/sglang-fast-rotation.

Server requirements

The BDR implementation is built on top of the SGLang codebase and currently assumes the following setup:

  • MHA models onlyMLA and other non-MHA layouts are not supported for these KV / BDR settings.
  • Prefill backend: `fa3`.
  • Decode backend: `triton`.

Install BDR

cd third_party/sglang-fast-rotation/python
pip install -e ".[all]"
pip install --no-build-isolation "git+https://github.com/Dao-AILab/fast-hadamard-transform.git"

Run BDR

BF16 KV (baseline)

python -m sglang.launch_server \
--prefill-attention-backend fa3 \
--decode-attention-backend triton \
--model-path "Qwen/Qwen3-4B-Thinking-2507" \
--port 30000 \
--kv-cache-dtype auto

Original INT4 KV

python -m sglang.launch_server \
--prefill-attention-backend fa3 \
--decode-attention-backend triton \
--model-path "Qwen/Qwen3-4B-Thinking-2507" \
--port 30000 \
--kv-cache-dtype int4

BDR (block diagnoal rotation on K)

HADAMARD=1 HADAMARD_ORDER=128 python -m sglang.launch_server \
--prefill-attention-backend fa3 \
--decode-attention-backend triton \
--model-path "Qwen/Qwen3-4B-Thinking-2507" \
--port 30000 \
--kv-cache-dtype int4

For the full env variable reference, and the complete mode matrix, see [docs/bdr_env_vars.md](docs/bdr_env_vars.md).

Quick demo (verify your install)

With the server running in any of the three modes above, run the smoke-test script from the repository root:

pip install openai # if not already installed
python scripts/bdr_smoke_test.py --port 30001 --model Qwen/Qwen3-4B-Thinking-2507

The script sends a GPQA sample question to the server and streams the response.

Server : http://0.0.0.0:30000/v1
Model : Qwen/Qwen3-4B-Thinking-2507

--- Prompt (GPQA sample) ---
Answer the following multiple choice question.....
...

--- Response ---

Primary accuracy and throughput

Accuracy (simple-evals / GPQA) and throughput (genai-bench) both use `third_party/sglang-fast-rotation`; server setup is in [How to run BDR](#how-to-run-bdr). Accuracy model: `Qwen/Qwen3-4B-Thinking-2507`. Throughput model: `Qwen/Qwen3-8B` (override MODEL_PATH in scripts if you align checkpoints).

Accuracy (primary)

Prepare

Prerequisite (GPQA client): [openai/simple-evals](https://github.com/openai/simple-evals) is included as a submodule at `third_party/simple-evals`.

git submodule update --init --checkout third_party/simple-evals
cd third_party/simple-evals
mkdir -p simple_evals
touch simple_evals/__init__.py
pip install openai pandas requests jinja2 tqdm numpy

Add a local model alias once in third_party/simple-evals/simple_evals.py inside the models = { ... } dictionary so simple-evals and set max_tokens=32768:

"qwen3_4b": ChatCompletionSampler(
model="Qwen/Qwen3-4B-Thinking-2507",
system_message=OPENAI_SYSTEM_MESSAGE_API,
max_tokens=32768,
),

RUN-GPQA

With simple-evals installed and the SGLang server already up (start it in the desired mode from [Run BDR](#run-bdr), using `Qwen/Qwen3-4B-Thinking-2507` as the model), point the…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine repo with very low stars