RepoAmazon (Nova)Amazon (Nova)published Feb 11, 2026seen 5d

amazon-science/icr-kv-caching-long-context-llms

Python

Open original ↗

Captured source

source ↗

amazon-science/icr-kv-caching-long-context-llms

Description: Exploring Fine-Tuning for In-Context Retrieval and Efficient KV-Caching in Long-Context Language Models

Language: Python

License: NOASSERTION

Stars: 6

Forks: 0

Open issues: 1

Created: 2026-02-11T10:19:41Z

Pushed: 2026-05-06T20:02:48Z

Default branch: main

Fork: no

Archived: no

README:

Exploring Fine-Tuning for In-Context Retrieval and Efficient KV-Caching in Long-Context Language Models

Official repository for the paper "Exploring Fine-Tuning for In-Context Retrieval and Efficient KV-Caching in Long-Context Language Models".

Table of Contents

  • [Repository Structure](#repository-structure)
  • [Installation](#installation)
  • [Quick Start](#quick-start)
  • [Inference](#inference)
  • [Evaluation](#evaluation)
  • [Training](#training)
  • [Preprocessing](#preprocessing)
  • [Analysis](#analysis)
  • [Advanced Topics](#advanced-topics)

---

Repository Structure

icr-kv-caching-long-context-llms/
├── config/ # YAML configuration files
│ ├── inference/ # Inference configs per benchmark/model
│ ├── evaluation/ # Evaluation configs
│ ├── training/ # Training configs (Kubernetes YAML)
│ ├── analysis/ # Attention analysis configs
│ └── server_inference/ # vLLM server configs
├── data/ # Data storage (outputs, indexes, benchmarks)
│ ├── outputs/ # Model inference outputs
│ ├── evaluation/ # Evaluation results
│ ├── indexes/ # FAISS indexes for RAG
│ ├── train/ # Training data
│ ├── benchmarks/ # Downloaded benchmarks
│ └── wikipedia/ # Wikipedia passages
├── plots/ # Plotting scripts and images
├── prompts/ # Prompt templates (JSON format)
│ ├── inference/ # Inference prompts
│ ├── evaluation/ # Evaluation prompts
│ └── analysis/ # Analysis prompts
├── scripts/ # Bash scripts for running experiments
│ ├── inference/ # Inference scripts per benchmark/model
│ ├── evaluation/ # Evaluation scripts per benchmark/model
│ └── training/ # Training launch scripts
├── src/amzn_long_context_rag/
│ ├── inference/ # Inference logic (local & vLLM)
│ ├── evaluation/ # Evaluation metrics
│ ├── training/ # Reward functions for RL training
│ ├── data/ # Data loaders
│ ├── retriever/ # RAG retriever implementation
│ ├── preprocessing/ # Data preprocessing utilities
│ └── analysis/ # Attention analysis tools
└── notebooks/ # Jupyter notebooks for exploration

---

Installation

Prerequisites

  • Python 3.10
  • CUDA-compatible GPU (for inference and training)
  • Conda or virtualenv

Setup

# Create conda environment
conda create -n longcontext python==3.10
conda activate longcontext

# Install the package
pip install -e .

# Install flash-attention (required for efficient inference)
pip install flash-attn==2.7.4.post1 --no-build-isolation

Verify Installation

python -c "import src.amzn_long_context_rag; print('Installation successful!')"

---

Quick Start

Run Inference on LongBench-v2

# Navigate to the repository root
cd icr-kv-caching-long-context-llms

# Run zero-shot inference
bash scripts/inference/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot.sh

# Run with RAG
bash scripts/inference/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot_rag.sh

Evaluate Results

# Evaluate the inference outputs
bash scripts/evaluation/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot.sh \
Qwen2.5-7B-Instruct-1M zero_shot

---

Inference

Overview

The main inference script is src/amzn_long_context_rag/inference/async_inference.py. It supports:

  • Full context mode: Pass entire context to the model
  • Top-k RAG mode: Retrieve top-k relevant chunks
  • No context mode: Zero-shot inference without context

Running Inference

Method 1: Using Bash Scripts (Recommended)

# General pattern
bash scripts/inference///.sh [OVERRIDES]

# Examples
bash scripts/inference/InfiniteBench/Qwen2.5-7B-Instruct-1M/zero_shot.sh
bash scripts/inference/Loong/glm-4-9b-chat-1m/zero_shot_rag.sh

Method 2: Direct Python Execution

python src/amzn_long_context_rag/inference/async_inference.py \
--config_path config/inference/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot.yaml \
[KEY=VALUE ...]

Inference Parameters

Command-Line Arguments

  • --config_path or -c: Path to YAML configuration file (required)
  • Additional arguments can override config values using dotlist notation: key.subkey=value

Configuration File Structure

seed: 42 # Random seed
output_dir: data/outputs # Output directory
model_name: Qwen/Qwen2.5-7B-Instruct-1M # HuggingFace model name
device: cuda # Device (cuda/cpu)

vllm_params:
gpu_memory_utilization: 0.8 # GPU memory fraction
max_model_len: 1010000 # Maximum sequence length
max_num_batched_tokens: 131072 # Batch size in tokens
enforce_eager: true # Disable CUDA graphs
tensor_parallel_size: 2 # Tensor parallelism
pipeline_parallel_size: 1 # Pipeline parallelism
enable_lora: false # Enable LoRA adapters
lora_adapter_path: null # Path to LoRA weights

sampling_params:
temperature: 0.0 # Sampling temperature
n: 1 # Number of completions
max_tokens: 2048 # Max generation length

dataset:
data_loader: LongBench-v2 # Dataset loader class
path: THUDM/LongBench-v2 # HuggingFace dataset path
split: train # Dataset split
name: null # Dataset subset name
prompt_obj: prompts/inference/LongBench-v2/zero_shot.json
continue_final_message: false # Continue from last message

retrieval:
mode: full # full/topk/none
top_k: 5 # Number of chunks (for topk)
embedding_model_name: Qwen/Qwen3-Embedding-4B
index_dir: data/indexes # FAISS index directory
use_offline_hits: false # Use pre-computed retrieval
offline_hits_path: null # Path to offline hits

context_max_tokens: null # Max context tokens (null=auto)
split_docs: false # Split context into [DOC i] format
limit_samples: null # Limit number of samples (for testing)

Runtime Overrides

Override any config parameter at runtime:

# Change max generation length
bash scripts/inference/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot.sh \
sampling_params.max_tokens=512

# Change model
bash scripts/inference/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot.sh \
model_name=Qwen/Qwen2.5-14B-Instruct-1M

# Enable RAG with top-k=10
bash scripts/inference/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot.sh \
retrieval.mode=topk retrieval.top_k=10

# Limit to 100 samples for testing
bash scripts/inference/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot.sh \
limit_samples=100

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low stars, routine research repo

Amazon (Nova) has a repo signal matching data demand, evals and quality.