meituan-longcat/R-HORIZON
Python
Captured source
source ↗meituan-longcat/R-HORIZON
Description: [ICLR'26] R-HORIZON: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
Language: Python
License: MIT
Stars: 26
Forks: 3
Open issues: 0
Created: 2025-10-21T09:40:58Z
Pushed: 2026-05-09T10:23:09Z
Default branch: main
Fork: no
Archived: no
README:
📃 Paper • 🌐 Project Page • 🤗 Dataset • 🤗 Models
R-HORIZON is a novel method designed to stimulate long-horizon reasoning behaviors in Large Reasoning Models (LRMs) through query composition. We transform isolated problems into complex multi-step reasoning scenarios, revealing that even the most advanced LRMs suffer significant performance degradation when facing interdependent problems that span long reasoning horizons.

🔥 Releases
[2026-03]
- 🤗 Models are available on Hugging Face: R-HORIZON Models
[2026-01]
- 🎉 R-HORIZON is Accepted to ICLR 2026!
[2025-10]
- 🎉 R-HORIZON Benchmark is now available! Test your LRMs on complex multi-horizon reasoning tasks.
- 🤗 Training and evaluation datasets are available on Hugging Face: R-HORIZON Dataset
- 📄 Paper released on arXiv: R-HORIZON: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
🌟 Overview
Recent advances in reasoning-focused language models (e.g., OpenAI o1, DeepSeek-R1) have demonstrated remarkable improvements through test-time scaling and long Chain-of-Thought (CoT). However, existing benchmarks primarily focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to handle complex, long-horizon scenarios.
Key challenges in current paradigms:
- Limited evaluation scope: Existing benchmarks confine themselves to isolated problems, missing the complexity of real-world multi-step reasoning
- Limited effective reasoning length: Models struggle to maintain performance as reasoning chains grow longer
- Poor thinking budget allocation: LRMs fail to appropriately distribute thinking resources across multiple interdependent problems
To address these limitations, we introduce R-HORIZON, which:
- Transforms isolated problems into complex multi-step reasoning scenarios through query composition
- Establishes the R-HORIZON Benchmark comprising 6 representative datasets from mathematics, code generation, and agent applications
- Enables reinforcement learning with verified rewards (RLVR) using long-horizon reasoning data

📖 Table of Contents
- [🔥 Releases](#-releases)
- [🌟 Overview](#-overview)
- [📊 R-HORIZON Benchmark](#-r-horizon-benchmark)
- [🚀 Training with R-HORIZON](#-training-with-r-horizon)
- [Quick Start](#quick-start)
- [Installation](#installation)
- [Benchmark Evaluation](#benchmark-evaluation)
- [Training with R-HORIZON datasets](#training-with-r-horizon-datasets)
- [Dataset](#dataset)
- [Dataset Construction](#dataset-construction)
- [Dataset on Hugging Face Hub](#dataset-on-hugging-face-hub)
- [Dataset Structure](#dataset-structure)
- [Citation](#citation)
📊 R-HORIZON Benchmark
We evaluate 20+ state-of-the-art LRMs on the R-HORIZON Benchmark, revealing significant performance degradation as reasoning horizons increase:

Key findings from our benchmark evaluation:
- Universal performance degradation: Even the most powerful models suffer severe drops as problem count increases. For instance, DeepSeek-R1 drops from 87.3% (single problem) to 24.6% (5 problems) on AIME25.
- Model size matters: Larger models exhibit more resilience to multi-horizon challenges. R1-Qwen-7B drops from 93.6% to 0% when solving 16 problems, showing 34.1% more degradation than the 32B models.
- Task-dependent degradation: Code generation tasks show steeper performance declines compared to mathematics. Many reasoning models lose their tool-calling abilities in web search scenarios, resulting in poor multi-step performance.
🚀 Training with R-HORIZON
Training with R-HORIZON composed data yields substantial improvements on both single and multi-horizon reasoning tasks:

Training results highlights:
- Dual Performance Gains: Training with 2-composed problems significantly improves both multi-horizon reasoning (+17.4 points on AIME24 n=2) and single-problem performance (+7.5 points on AIME24 original).
- Scalable Complexity: Increasing composition complexity (n=4) enhances the model's ability to handle problems requiring more reasoning steps, achieving 50.6% on Math500 (n=8).
| Models | MATH500 (Origin) | MATH500 (n=8) | AIME24 (Origin) | AIME24 (n=2) | AIME25 (Origin) | AIME25 (n=2) | AMC23 (Origin) | AMC23 (n=2) | |-----------------|------------------|---------------|-----------------|--------------|-----------------|--------------|----------------|-------------| | R1-Qwen-7B | 93.6 | 11.8 | 48.3 | 16.4 | 33.3 | 3.5 | 90.2 | 48.8 | | Baseline (n=1) | 95.6 | 8.4 | 57.9 | 16.7 | 47.9 | 5.1 | 95.9 | 55.0 | | R-HORIZON (n=2) | 95.4 | 21.4 | 65.4 | 34.1 | 49.6 | 10.0 | 94.1 | 80.6 | | R-HORIZON (n=4) | 94.6 | 50.6 | 62.9 | 34.8 | 45.4 | 8.1 | 91.9 | 79.1 |
Quick Start
Installation
# Clone the repository git clone https://github.com/meituan-longcat/R-HORIZON.git cd R-HORIZON # Create conda environment conda create -n r-horizon python=3.10 -y conda activate r-horizon # Install PyTorch pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124 pip3 install flash-attn --no-build-isolation # Install additional dependencies pip install -r requirements.txt
Benchmark Evaluation
1. Download the R-HORIZON Benchmark
# Download benchmark datasets python ./evaluation/data/download.py
2. Modify config.json under evaluation directory
{
"inference": {
// model_key (e.g. r1-distill-qwen7b) is for run.sh
"r1-distill-qwen7b": {
// the ip and port used in vllm server
"base_url": "http://{Your IP and Port}/v1/completions",
"api_key": "EMPTY",
// model_name is corresponding to the modelname in vllm server
"model_name": "{vllm's modelname}",
"params": {
"temperature": 1.0,
"top_p": 0.95,
"top_k": 10,
"max_tokens": 65536
},
"prompt_prefix": "user:\n",
"prompt_suffix":…Excerpt shown — open the source for the full document.
Notability
notability 3.0/10New repo with 26 stars, routine