What does this repo signal mean?

NVIDIA published NVIDIA/kvpress (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo NVIDIA/kvpress · language Python. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

NVIDIA Repo: NVIDIA/kvpress

Captured source

source ↗

GitHub/github.com/NVIDIA/kvpress

NVIDIA/kvpress repository metadata

Source ↗

published Nov 6, 2024seen 1dcaptured 9hhttp 200method plain

NVIDIA/kvpress

Description: LLM KV cache compression made easy

Language: Python

License: Apache-2.0

Stars: 1108

Forks: 150

Open issues: 4

Created: 2024-11-06T19:23:20Z

Pushed: 2026-06-10T10:03:06Z

Default branch: main

Fork: no

Archived: no

README: ![PyPI version](https://badge.fury.io/py/kvpress) ![Colab example notebook](https://colab.research.google.com/drive/1JNvaTKuuAHrl49dYB9-mdEH_y52Ib-NP?usp=drive_link)

![kvpress](kvpress.jpg)

Deploying long-context LLMs is costly due to the linear growth of the key-value (KV) cache in transformer models. For example, handling 1M tokens with Llama 3.1-70B in float16 requires up to 330GB of memory. kvpress implements multiple KV cache compression methods and benchmarks using 🤗 transformers, aiming to simplify the development of new methods for researchers and developers in this field.

Installation

pip install kvpress

For a local installation, use uv:

git clone https://github.com/NVIDIA/kvpress.git
cd kvpress
uv sync

To install with all optional dependencies, run:

git clone https://github.com/NVIDIA/kvpress.git
cd kvpress
uv sync --extra eval --extra flash-attn

Usage

KVPress provides a set of "presses" that compress the KV cache during the prefilling-phase. Each press is associated with a compression_ratio attribute that measures the compression of the cache. The easiest way to use a press is through our custom KVPressTextGenerationPipeline. It is automatically registered as a transformers pipeline with the name "kv-press-text-generation" when kvpress is imported and handles chat templates and tokenization for you:

from transformers import pipeline
from kvpress import ExpectedAttentionPress

model = "Qwen/Qwen3-8B"
pipe = pipeline("kv-press-text-generation", model=model, device_map="auto", dtype="auto")

context = "A very long text you want to compress once and for all"
question = "\nA question about the compressed context" # optional

press = ExpectedAttentionPress(compression_ratio=0.5)
answer = pipe(context, question=question, press=press)["answer"]

In the snippet above, the compression is only applied on the context tokens so that you can evaluate the compression for different questions. Check the [Wikipedia notebook demo](notebooks/wikipedia_demo.ipynb) for a more detailed example (also available on Colab here).

Decoding Compression

By default, KVPress applies compression during the prefilling phase. As a new (experimental) feature, we now support decoding compression via the DecodingPress wrapper. DecodingPress compresses the KV cache periodically during token generation, optionally maintaining a buffer of recent hidden states. DecodingPress supports the following parameters:

base_press: Any ScorerPress (e.g., KNormPress, CriticalKVPress)
compression_interval: Steps between compressions (default: 10)
target_size: Target cache size of the cache after compression (default: 1024)
hidden_states_buffer_size: Number of hidden states to buffer before compression (default: 128). Some presses don't need buffered hidden states and can set this to 0.

Unlike a compression ratio, decoding press uses a target_size to compress the cache. This means that the cache is compressed every compression_interval steps, and the compression ratio is automatically computed such that the size of the cache after compression equals target_size.

An example for decoding compression:

from transformers import pipeline
from kvpress import KnormPress
from kvpress import DecodingPress

# Initialize the pipeline
device = "cuda:0"
model = "meta-llama/Llama-3.1-8B-Instruct"
model_kwargs = {"attn_implementation": "flash_attention_2"}
pipe = pipeline("kv-press-text-generation", model=model, device=device, model_kwargs=model_kwargs)

# Create a decoding press that compresses every 10 steps to 512 tokens
decoding_press = DecodingPress(
base_press=KnormPress(),
compression_steps=10,
token_buffer_size=512
)

# Use with pipeline
context = "A very long text you want to compress during generation"
question = "Tell me a long story about this context"
response = pipe(context, question=question, press=decoding_press)["answer"]

> Not all existing presses are fully compatible with DecodingPress due to fundamental differences in how compression works during decoding versus prefilling. in particular, we only support ScorerPresses as base presses.

Available presses

All current presses are training free and inherit from BasePress ([source](kvpress/presses/base_press.py)).

Several presses inherit from ScorerPress ([source](kvpress/presses/scorer_press.py)) and rely on a score to prune the KV pairs with lowest importance:

RandomPress ([source](kvpress/presses/random_press.py)): random score
KnormPress ([source](kvpress/presses/knorm_press.py), paper): inverse norm of the key
SnapKVPress ([source](kvpress/presses/snapkv_press.py), paper): average attention weight of the last queries
ExpectedAttentionPress ([source](kvpress/presses/expected_attention_press.py), [notebook](notebooks/expected_attention.ipynb)): expected attention weight during the generation phase
StreamingLLMPress ([source](kvpress/presses/streaming_llm_press.py), paper): keep only the initial and recent tokens
TOVAPress ([source](kvpress/presses/tova_press.py), paper): attention weight of the last query averaged across heads
ObservedAttentionPress ([source](kvpress/presses/observed_attention_press.py), paper): average attention weight observed during in prefilling phase
QFilterPress ([source](kvpress/presses/qfilter_press.py), paper): project the Key representations on the main SVD component of the Query vectors to approximate the attention scores.
PyramidKVPress ([source](kvpress/presses/pyramidkv_press.py), paper): maintain pyramid-like cache sizes, allocating more cache budget to lower layers and less to higher layers
LagKVPress ([source](kvpress/presses/lagkv_press.py),…

Excerpt shown — open the source for the full document.