RepoNVIDIANVIDIApublished Nov 6, 2024seen 1d

NVIDIA/kvpress

Python

Open original ↗

Captured source

source ↗
published Nov 6, 2024seen 1dcaptured 9hhttp 200method plain

NVIDIA/kvpress

Description: LLM KV cache compression made easy

Language: Python

License: Apache-2.0

Stars: 1108

Forks: 150

Open issues: 4

Created: 2024-11-06T19:23:20Z

Pushed: 2026-06-10T10:03:06Z

Default branch: main

Fork: no

Archived: no

README: ![PyPI version](https://badge.fury.io/py/kvpress) ![Colab example notebook](https://colab.research.google.com/drive/1JNvaTKuuAHrl49dYB9-mdEH_y52Ib-NP?usp=drive_link)

![kvpress](kvpress.jpg)

Deploying long-context LLMs is costly due to the linear growth of the key-value (KV) cache in transformer models. For example, handling 1M tokens with Llama 3.1-70B in float16 requires up to 330GB of memory. kvpress implements multiple KV cache compression methods and benchmarks using 🤗 transformers, aiming to simplify the development of new methods for researchers and developers in this field.

Installation

pip install kvpress

For a local installation, use uv:

git clone https://github.com/NVIDIA/kvpress.git
cd kvpress
uv sync

To install with all optional dependencies, run:

git clone https://github.com/NVIDIA/kvpress.git
cd kvpress
uv sync --extra eval --extra flash-attn

Usage

KVPress provides a set of "presses" that compress the KV cache during the prefilling-phase. Each press is associated with a compression_ratio attribute that measures the compression of the cache. The easiest way to use a press is through our custom KVPressTextGenerationPipeline. It is automatically registered as a transformers pipeline with the name "kv-press-text-generation" when kvpress is imported and handles chat templates and tokenization for you:

from transformers import pipeline
from kvpress import ExpectedAttentionPress

model = "Qwen/Qwen3-8B"
pipe = pipeline("kv-press-text-generation", model=model, device_map="auto", dtype="auto")

context = "A very long text you want to compress once and for all"
question = "\nA question about the compressed context" # optional

press = ExpectedAttentionPress(compression_ratio=0.5)
answer = pipe(context, question=question, press=press)["answer"]

In the snippet above, the compression is only applied on the context tokens so that you can evaluate the compression for different questions. Check the [Wikipedia notebook demo](notebooks/wikipedia_demo.ipynb) for a more detailed example (also available on Colab here).

Decoding Compression

By default, KVPress applies compression during the prefilling phase. As a new (experimental) feature, we now support decoding compression via the DecodingPress wrapper. DecodingPress compresses the KV cache periodically during token generation, optionally maintaining a buffer of recent hidden states. DecodingPress supports the following parameters:

  • base_press: Any ScorerPress (e.g., KNormPress, CriticalKVPress)
  • compression_interval: Steps between compressions (default: 10)
  • target_size: Target cache size of the cache after compression (default: 1024)
  • hidden_states_buffer_size: Number of hidden states to buffer before compression (default: 128). Some presses don't need buffered hidden states and can set this to 0.

Unlike a compression ratio, decoding press uses a target_size to compress the cache. This means that the cache is compressed every compression_interval steps, and the compression ratio is automatically computed such that the size of the cache after compression equals target_size.

An example for decoding compression:

from transformers import pipeline
from kvpress import KnormPress
from kvpress import DecodingPress

# Initialize the pipeline
device = "cuda:0"
model = "meta-llama/Llama-3.1-8B-Instruct"
model_kwargs = {"attn_implementation": "flash_attention_2"}
pipe = pipeline("kv-press-text-generation", model=model, device=device, model_kwargs=model_kwargs)

# Create a decoding press that compresses every 10 steps to 512 tokens
decoding_press = DecodingPress(
base_press=KnormPress(),
compression_steps=10,
token_buffer_size=512
)

# Use with pipeline
context = "A very long text you want to compress during generation"
question = "Tell me a long story about this context"
response = pipe(context, question=question, press=decoding_press)["answer"]

> Not all existing presses are fully compatible with DecodingPress due to fundamental differences in how compression works during decoding versus prefilling. in particular, we only support ScorerPresses as base presses.

Available presses

All current presses are training free and inherit from BasePress ([source](kvpress/presses/base_press.py)).

Several presses inherit from ScorerPress ([source](kvpress/presses/scorer_press.py)) and rely on a score to prune the KV pairs with lowest importance:

  • RandomPress ([source](kvpress/presses/random_press.py)): random score
  • KnormPress ([source](kvpress/presses/knorm_press.py), paper): inverse norm of the key
  • SnapKVPress ([source](kvpress/presses/snapkv_press.py), paper): average attention weight of the last queries
  • ExpectedAttentionPress ([source](kvpress/presses/expected_attention_press.py), [notebook](notebooks/expected_attention.ipynb)): expected attention weight during the generation phase
  • StreamingLLMPress ([source](kvpress/presses/streaming_llm_press.py), paper): keep only the initial and recent tokens
  • TOVAPress ([source](kvpress/presses/tova_press.py), paper): attention weight of the last query averaged across heads
  • ObservedAttentionPress ([source](kvpress/presses/observed_attention_press.py), paper): average attention weight observed during in prefilling phase
  • QFilterPress ([source](kvpress/presses/qfilter_press.py), paper): project the Key representations on the main SVD component of the Query vectors to approximate the attention scores.
  • PyramidKVPress ([source](kvpress/presses/pyramidkv_press.py), paper): maintain pyramid-like cache sizes, allocating more cache budget to lower layers and less to higher layers
  • LagKVPress ([source](kvpress/presses/lagkv_press.py),…

Excerpt shown — open the source for the full document.