amazon-science/icr-kv-caching-long-context-llms
Python
Captured source
source ↗GH
Source ↗published Feb 11, 2026seen 5dcaptured 9hhttp 200method plain
amazon-science/icr-kv-caching-long-context-llms
Description: Exploring Fine-Tuning for In-Context Retrieval and Efficient KV-Caching in Long-Context Language Models
Language: Python
License: NOASSERTION
Stars: 6
Forks: 0
Open issues: 1
Created: 2026-02-11T10:19:41Z
Pushed: 2026-05-06T20:02:48Z
Default branch: main
Fork: no
Archived: no
README:
Exploring Fine-Tuning for In-Context Retrieval and Efficient KV-Caching in Long-Context Language Models
Official repository for the paper "Exploring Fine-Tuning for In-Context Retrieval and Efficient KV-Caching in Long-Context Language Models".
Table of Contents
- [Repository Structure](#repository-structure)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Inference](#inference)
- [Evaluation](#evaluation)
- [Training](#training)
- [Preprocessing](#preprocessing)
- [Analysis](#analysis)
- [Advanced Topics](#advanced-topics)
---
Repository Structure
icr-kv-caching-long-context-llms/ ├── config/ # YAML configuration files │ ├── inference/ # Inference configs per benchmark/model │ ├── evaluation/ # Evaluation configs │ ├── training/ # Training configs (Kubernetes YAML) │ ├── analysis/ # Attention analysis configs │ └── server_inference/ # vLLM server configs ├── data/ # Data storage (outputs, indexes, benchmarks) │ ├── outputs/ # Model inference outputs │ ├── evaluation/ # Evaluation results │ ├── indexes/ # FAISS indexes for RAG │ ├── train/ # Training data │ ├── benchmarks/ # Downloaded benchmarks │ └── wikipedia/ # Wikipedia passages ├── plots/ # Plotting scripts and images ├── prompts/ # Prompt templates (JSON format) │ ├── inference/ # Inference prompts │ ├── evaluation/ # Evaluation prompts │ └── analysis/ # Analysis prompts ├── scripts/ # Bash scripts for running experiments │ ├── inference/ # Inference scripts per benchmark/model │ ├── evaluation/ # Evaluation scripts per benchmark/model │ └── training/ # Training launch scripts ├── src/amzn_long_context_rag/ │ ├── inference/ # Inference logic (local & vLLM) │ ├── evaluation/ # Evaluation metrics │ ├── training/ # Reward functions for RL training │ ├── data/ # Data loaders │ ├── retriever/ # RAG retriever implementation │ ├── preprocessing/ # Data preprocessing utilities │ └── analysis/ # Attention analysis tools └── notebooks/ # Jupyter notebooks for exploration
---
Installation
Prerequisites
- Python 3.10
- CUDA-compatible GPU (for inference and training)
- Conda or virtualenv
Setup
# Create conda environment conda create -n longcontext python==3.10 conda activate longcontext # Install the package pip install -e . # Install flash-attention (required for efficient inference) pip install flash-attn==2.7.4.post1 --no-build-isolation
Verify Installation
python -c "import src.amzn_long_context_rag; print('Installation successful!')"---
Quick Start
Run Inference on LongBench-v2
# Navigate to the repository root cd icr-kv-caching-long-context-llms # Run zero-shot inference bash scripts/inference/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot.sh # Run with RAG bash scripts/inference/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot_rag.sh
Evaluate Results
# Evaluate the inference outputs bash scripts/evaluation/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot.sh \ Qwen2.5-7B-Instruct-1M zero_shot
---
Inference
Overview
The main inference script is src/amzn_long_context_rag/inference/async_inference.py. It supports:
- Full context mode: Pass entire context to the model
- Top-k RAG mode: Retrieve top-k relevant chunks
- No context mode: Zero-shot inference without context
Running Inference
Method 1: Using Bash Scripts (Recommended)
# General pattern bash scripts/inference///.sh [OVERRIDES] # Examples bash scripts/inference/InfiniteBench/Qwen2.5-7B-Instruct-1M/zero_shot.sh bash scripts/inference/Loong/glm-4-9b-chat-1m/zero_shot_rag.sh
Method 2: Direct Python Execution
python src/amzn_long_context_rag/inference/async_inference.py \ --config_path config/inference/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot.yaml \ [KEY=VALUE ...]
Inference Parameters
Command-Line Arguments
--config_pathor-c: Path to YAML configuration file (required)- Additional arguments can override config values using dotlist notation:
key.subkey=value
Configuration File Structure
seed: 42 # Random seed output_dir: data/outputs # Output directory model_name: Qwen/Qwen2.5-7B-Instruct-1M # HuggingFace model name device: cuda # Device (cuda/cpu) vllm_params: gpu_memory_utilization: 0.8 # GPU memory fraction max_model_len: 1010000 # Maximum sequence length max_num_batched_tokens: 131072 # Batch size in tokens enforce_eager: true # Disable CUDA graphs tensor_parallel_size: 2 # Tensor parallelism pipeline_parallel_size: 1 # Pipeline parallelism enable_lora: false # Enable LoRA adapters lora_adapter_path: null # Path to LoRA weights sampling_params: temperature: 0.0 # Sampling temperature n: 1 # Number of completions max_tokens: 2048 # Max generation length dataset: data_loader: LongBench-v2 # Dataset loader class path: THUDM/LongBench-v2 # HuggingFace dataset path split: train # Dataset split name: null # Dataset subset name prompt_obj: prompts/inference/LongBench-v2/zero_shot.json continue_final_message: false # Continue from last message retrieval: mode: full # full/topk/none top_k: 5 # Number of chunks (for topk) embedding_model_name: Qwen/Qwen3-Embedding-4B index_dir: data/indexes # FAISS index directory use_offline_hits: false # Use pre-computed retrieval offline_hits_path: null # Path to offline hits context_max_tokens: null # Max context tokens (null=auto) split_docs: false # Split context into [DOC i] format limit_samples: null # Limit number of samples (for testing)
Runtime Overrides
Override any config parameter at runtime:
# Change max generation length bash scripts/inference/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot.sh \ sampling_params.max_tokens=512 # Change model bash scripts/inference/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot.sh \ model_name=Qwen/Qwen2.5-14B-Instruct-1M # Enable RAG with top-k=10 bash scripts/inference/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot.sh \ retrieval.mode=topk retrieval.top_k=10 # Limit to 100 samples for testing bash scripts/inference/LongBench-v2/Qwen2.5-7B-Instruct-1M/zero_shot.sh \ limit_samples=100
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low stars, routine research repo
Amazon (Nova) has a repo signal matching data demand, evals and quality.