RepoMeituan (LongCat)Meituan (LongCat)published Oct 21, 2025seen 5d

meituan-longcat/R-HORIZON

Python

Open original ↗

Captured source

source ↗
published Oct 21, 2025seen 5dcaptured 9hhttp 200method plain

meituan-longcat/R-HORIZON

Description: [ICLR'26] R-HORIZON: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

Language: Python

License: MIT

Stars: 26

Forks: 3

Open issues: 0

Created: 2025-10-21T09:40:58Z

Pushed: 2026-05-09T10:23:09Z

Default branch: main

Fork: no

Archived: no

README:

📃 Paper • 🌐 Project Page • 🤗 Dataset • 🤗 Models

R-HORIZON is a novel method designed to stimulate long-horizon reasoning behaviors in Large Reasoning Models (LRMs) through query composition. We transform isolated problems into complex multi-step reasoning scenarios, revealing that even the most advanced LRMs suffer significant performance degradation when facing interdependent problems that span long reasoning horizons.

![](./assets/mainfig.png)

🔥 Releases

[2026-03]

[2026-01]

  • 🎉 R-HORIZON is Accepted to ICLR 2026!

[2025-10]

🌟 Overview

Recent advances in reasoning-focused language models (e.g., OpenAI o1, DeepSeek-R1) have demonstrated remarkable improvements through test-time scaling and long Chain-of-Thought (CoT). However, existing benchmarks primarily focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to handle complex, long-horizon scenarios.

Key challenges in current paradigms:

  • Limited evaluation scope: Existing benchmarks confine themselves to isolated problems, missing the complexity of real-world multi-step reasoning
  • Limited effective reasoning length: Models struggle to maintain performance as reasoning chains grow longer
  • Poor thinking budget allocation: LRMs fail to appropriately distribute thinking resources across multiple interdependent problems

To address these limitations, we introduce R-HORIZON, which:

  • Transforms isolated problems into complex multi-step reasoning scenarios through query composition
  • Establishes the R-HORIZON Benchmark comprising 6 representative datasets from mathematics, code generation, and agent applications
  • Enables reinforcement learning with verified rewards (RLVR) using long-horizon reasoning data

![](./assets/method_fig.png)

📖 Table of Contents

  • [🔥 Releases](#-releases)
  • [🌟 Overview](#-overview)
  • [📊 R-HORIZON Benchmark](#-r-horizon-benchmark)
  • [🚀 Training with R-HORIZON](#-training-with-r-horizon)
  • [Quick Start](#quick-start)
  • [Installation](#installation)
  • [Benchmark Evaluation](#benchmark-evaluation)
  • [Training with R-HORIZON datasets](#training-with-r-horizon-datasets)
  • [Dataset](#dataset)
  • [Dataset Construction](#dataset-construction)
  • [Dataset on Hugging Face Hub](#dataset-on-hugging-face-hub)
  • [Dataset Structure](#dataset-structure)
  • [Citation](#citation)

📊 R-HORIZON Benchmark

We evaluate 20+ state-of-the-art LRMs on the R-HORIZON Benchmark, revealing significant performance degradation as reasoning horizons increase:

![](./assets/result_fig.png)

Key findings from our benchmark evaluation:

  • Universal performance degradation: Even the most powerful models suffer severe drops as problem count increases. For instance, DeepSeek-R1 drops from 87.3% (single problem) to 24.6% (5 problems) on AIME25.
  • Model size matters: Larger models exhibit more resilience to multi-horizon challenges. R1-Qwen-7B drops from 93.6% to 0% when solving 16 problems, showing 34.1% more degradation than the 32B models.
  • Task-dependent degradation: Code generation tasks show steeper performance declines compared to mathematics. Many reasoning models lose their tool-calling abilities in web search scenarios, resulting in poor multi-step performance.

🚀 Training with R-HORIZON

Training with R-HORIZON composed data yields substantial improvements on both single and multi-horizon reasoning tasks:

![](./assets/skywork_n1_n2_comparison.png)

Training results highlights:

  • Dual Performance Gains: Training with 2-composed problems significantly improves both multi-horizon reasoning (+17.4 points on AIME24 n=2) and single-problem performance (+7.5 points on AIME24 original).
  • Scalable Complexity: Increasing composition complexity (n=4) enhances the model's ability to handle problems requiring more reasoning steps, achieving 50.6% on Math500 (n=8).

| Models | MATH500 (Origin) | MATH500 (n=8) | AIME24 (Origin) | AIME24 (n=2) | AIME25 (Origin) | AIME25 (n=2) | AMC23 (Origin) | AMC23 (n=2) | |-----------------|------------------|---------------|-----------------|--------------|-----------------|--------------|----------------|-------------| | R1-Qwen-7B | 93.6 | 11.8 | 48.3 | 16.4 | 33.3 | 3.5 | 90.2 | 48.8 | | Baseline (n=1) | 95.6 | 8.4 | 57.9 | 16.7 | 47.9 | 5.1 | 95.9 | 55.0 | | R-HORIZON (n=2) | 95.4 | 21.4 | 65.4 | 34.1 | 49.6 | 10.0 | 94.1 | 80.6 | | R-HORIZON (n=4) | 94.6 | 50.6 | 62.9 | 34.8 | 45.4 | 8.1 | 91.9 | 79.1 |

Quick Start

Installation

# Clone the repository
git clone https://github.com/meituan-longcat/R-HORIZON.git
cd R-HORIZON

# Create conda environment
conda create -n r-horizon python=3.10 -y
conda activate r-horizon

# Install PyTorch
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation

# Install additional dependencies
pip install -r requirements.txt

Benchmark Evaluation

1. Download the R-HORIZON Benchmark

# Download benchmark datasets
python ./evaluation/data/download.py

2. Modify config.json under evaluation directory

{
"inference": {
// model_key (e.g. r1-distill-qwen7b) is for run.sh
"r1-distill-qwen7b": {
// the ip and port used in vllm server
"base_url": "http://{Your IP and Port}/v1/completions",
"api_key": "EMPTY",
// model_name is corresponding to the modelname in vllm server
"model_name": "{vllm's modelname}",
"params": {
"temperature": 1.0,
"top_p": 0.95,
"top_k": 10,
"max_tokens": 65536
},
"prompt_prefix": "user:\n",
"prompt_suffix":…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New repo with 26 stars, routine