NVIDIA/kvpress
Python
Captured source
source ↗NVIDIA/kvpress
Description: LLM KV cache compression made easy
Language: Python
License: Apache-2.0
Stars: 1108
Forks: 150
Open issues: 4
Created: 2024-11-06T19:23:20Z
Pushed: 2026-06-10T10:03:06Z
Default branch: main
Fork: no
Archived: no
README:  

Deploying long-context LLMs is costly due to the linear growth of the key-value (KV) cache in transformer models. For example, handling 1M tokens with Llama 3.1-70B in float16 requires up to 330GB of memory. kvpress implements multiple KV cache compression methods and benchmarks using 🤗 transformers, aiming to simplify the development of new methods for researchers and developers in this field.
Installation
pip install kvpress
For a local installation, use uv:
git clone https://github.com/NVIDIA/kvpress.git cd kvpress uv sync
To install with all optional dependencies, run:
git clone https://github.com/NVIDIA/kvpress.git cd kvpress uv sync --extra eval --extra flash-attn
Usage
KVPress provides a set of "presses" that compress the KV cache during the prefilling-phase. Each press is associated with a compression_ratio attribute that measures the compression of the cache. The easiest way to use a press is through our custom KVPressTextGenerationPipeline. It is automatically registered as a transformers pipeline with the name "kv-press-text-generation" when kvpress is imported and handles chat templates and tokenization for you:
from transformers import pipeline
from kvpress import ExpectedAttentionPress
model = "Qwen/Qwen3-8B"
pipe = pipeline("kv-press-text-generation", model=model, device_map="auto", dtype="auto")
context = "A very long text you want to compress once and for all"
question = "\nA question about the compressed context" # optional
press = ExpectedAttentionPress(compression_ratio=0.5)
answer = pipe(context, question=question, press=press)["answer"]In the snippet above, the compression is only applied on the context tokens so that you can evaluate the compression for different questions. Check the [Wikipedia notebook demo](notebooks/wikipedia_demo.ipynb) for a more detailed example (also available on Colab here).
Decoding Compression
By default, KVPress applies compression during the prefilling phase. As a new (experimental) feature, we now support decoding compression via the DecodingPress wrapper. DecodingPress compresses the KV cache periodically during token generation, optionally maintaining a buffer of recent hidden states. DecodingPress supports the following parameters:
base_press: Any ScorerPress (e.g.,KNormPress,CriticalKVPress)compression_interval: Steps between compressions (default: 10)target_size: Target cache size of the cache after compression (default: 1024)hidden_states_buffer_size: Number of hidden states to buffer before compression (default: 128). Some presses don't need buffered hidden states and can set this to 0.
Unlike a compression ratio, decoding press uses a target_size to compress the cache. This means that the cache is compressed every compression_interval steps, and the compression ratio is automatically computed such that the size of the cache after compression equals target_size.
An example for decoding compression:
from transformers import pipeline
from kvpress import KnormPress
from kvpress import DecodingPress
# Initialize the pipeline
device = "cuda:0"
model = "meta-llama/Llama-3.1-8B-Instruct"
model_kwargs = {"attn_implementation": "flash_attention_2"}
pipe = pipeline("kv-press-text-generation", model=model, device=device, model_kwargs=model_kwargs)
# Create a decoding press that compresses every 10 steps to 512 tokens
decoding_press = DecodingPress(
base_press=KnormPress(),
compression_steps=10,
token_buffer_size=512
)
# Use with pipeline
context = "A very long text you want to compress during generation"
question = "Tell me a long story about this context"
response = pipe(context, question=question, press=decoding_press)["answer"]> Not all existing presses are fully compatible with DecodingPress due to fundamental differences in how compression works during decoding versus prefilling. in particular, we only support ScorerPresses as base presses.
Available presses
All current presses are training free and inherit from BasePress ([source](kvpress/presses/base_press.py)).
Several presses inherit from ScorerPress ([source](kvpress/presses/scorer_press.py)) and rely on a score to prune the KV pairs with lowest importance:
RandomPress([source](kvpress/presses/random_press.py)): random scoreKnormPress([source](kvpress/presses/knorm_press.py), paper): inverse norm of the keySnapKVPress([source](kvpress/presses/snapkv_press.py), paper): average attention weight of the last queriesExpectedAttentionPress([source](kvpress/presses/expected_attention_press.py), [notebook](notebooks/expected_attention.ipynb)): expected attention weight during the generation phaseStreamingLLMPress([source](kvpress/presses/streaming_llm_press.py), paper): keep only the initial and recent tokensTOVAPress([source](kvpress/presses/tova_press.py), paper): attention weight of the last query averaged across headsObservedAttentionPress([source](kvpress/presses/observed_attention_press.py), paper): average attention weight observed during in prefilling phaseQFilterPress([source](kvpress/presses/qfilter_press.py), paper): project the Key representations on the main SVD component of the Query vectors to approximate the attention scores.PyramidKVPress([source](kvpress/presses/pyramidkv_press.py), paper): maintain pyramid-like cache sizes, allocating more cache budget to lower layers and less to higher layersLagKVPress([source](kvpress/presses/lagkv_press.py),…
Excerpt shown — open the source for the full document.