What does this repo signal mean?

Meituan (LongCat) published meituan-longcat/R-HORIZON (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo meituan-longcat/R-HORIZON · language Python · New repo with 26 stars, routine. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Meituan (LongCat) Repo: meituan-longcat/R-HORIZON

Captured source

source ↗

GitHub/github.com/meituan-longcat/R-HORIZON

meituan-longcat/R-HORIZON repository metadata

Source ↗

published Oct 21, 2025seen Jun 5captured Jun 11http 200method plain

meituan-longcat/R-HORIZON

Description: [ICLR'26] R-HORIZON: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

Language: Python

License: MIT

Stars: 26

Forks: 3

Open issues: 0

Created: 2025-10-21T09:40:58Z

Pushed: 2026-05-09T10:23:09Z

Default branch: main

Fork: no

Archived: no

README:

📃 Paper • 🌐 Project Page • 🤗 Dataset • 🤗 Models

R-HORIZON is a novel method designed to stimulate long-horizon reasoning behaviors in Large Reasoning Models (LRMs) through query composition. We transform isolated problems into complex multi-step reasoning scenarios, revealing that even the most advanced LRMs suffer significant performance degradation when facing interdependent problems that span long reasoning horizons.

![](./assets/mainfig.png)

🔥 Releases

[2026-03]

🤗 Models are available on Hugging Face: R-HORIZON Models

[2026-01]

🎉 R-HORIZON is Accepted to ICLR 2026!

[2025-10]

🎉 R-HORIZON Benchmark is now available! Test your LRMs on complex multi-horizon reasoning tasks.
🤗 Training and evaluation datasets are available on Hugging Face: R-HORIZON Dataset
📄 Paper released on arXiv: R-HORIZON: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

🌟 Overview

Recent advances in reasoning-focused language models (e.g., OpenAI o1, DeepSeek-R1) have demonstrated remarkable improvements through test-time scaling and long Chain-of-Thought (CoT). However, existing benchmarks primarily focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to handle complex, long-horizon scenarios.

Key challenges in current paradigms:

Limited evaluation scope: Existing benchmarks confine themselves to isolated problems, missing the complexity of real-world multi-step reasoning
Limited effective reasoning length: Models struggle to maintain performance as reasoning chains grow longer
Poor thinking budget allocation: LRMs fail to appropriately distribute thinking resources across multiple interdependent problems

To address these limitations, we introduce R-HORIZON, which:

Transforms isolated problems into complex multi-step reasoning scenarios through query composition
Establishes the R-HORIZON Benchmark comprising 6 representative datasets from mathematics, code generation, and agent applications
Enables reinforcement learning with verified rewards (RLVR) using long-horizon reasoning data

![](./assets/method_fig.png)

📖 Table of Contents

[🔥 Releases](#-releases)
[🌟 Overview](#-overview)
[📊 R-HORIZON Benchmark](#-r-horizon-benchmark)
[🚀 Training with R-HORIZON](#-training-with-r-horizon)
[Quick Start](#quick-start)
[Installation](#installation)
[Benchmark Evaluation](#benchmark-evaluation)
[Training with R-HORIZON datasets](#training-with-r-horizon-datasets)
[Dataset](#dataset)
[Dataset Construction](#dataset-construction)
[Dataset on Hugging Face Hub](#dataset-on-hugging-face-hub)
[Dataset Structure](#dataset-structure)
[Citation](#citation)

📊 R-HORIZON Benchmark

We evaluate 20+ state-of-the-art LRMs on the R-HORIZON Benchmark, revealing significant performance degradation as reasoning horizons increase:

![](./assets/result_fig.png)

Key findings from our benchmark evaluation:

Universal performance degradation: Even the most powerful models suffer severe drops as problem count increases. For instance, DeepSeek-R1 drops from 87.3% (single problem) to 24.6% (5 problems) on AIME25.

Model size matters: Larger models exhibit more resilience to multi-horizon challenges. R1-Qwen-7B drops from 93.6% to 0% when solving 16 problems, showing 34.1% more degradation than the 32B models.

Task-dependent degradation: Code generation tasks show steeper performance declines compared to mathematics. Many reasoning models lose their tool-calling abilities in web search scenarios, resulting in poor multi-step performance.

🚀 Training with R-HORIZON

Training with R-HORIZON composed data yields substantial improvements on both single and multi-horizon reasoning tasks:

![](./assets/skywork_n1_n2_comparison.png)

Training results highlights:

Dual Performance Gains: Training with 2-composed problems significantly improves both multi-horizon reasoning (+17.4 points on AIME24 n=2) and single-problem performance (+7.5 points on AIME24 original).

Scalable Complexity: Increasing composition complexity (n=4) enhances the model's ability to handle problems requiring more reasoning steps, achieving 50.6% on Math500 (n=8).

| Models | MATH500 (Origin) | MATH500 (n=8) | AIME24 (Origin) | AIME24 (n=2) | AIME25 (Origin) | AIME25 (n=2) | AMC23 (Origin) | AMC23 (n=2) | |-----------------|------------------|---------------|-----------------|--------------|-----------------|--------------|----------------|-------------| | R1-Qwen-7B | 93.6 | 11.8 | 48.3 | 16.4 | 33.3 | 3.5 | 90.2 | 48.8 | | Baseline (n=1) | 95.6 | 8.4 | 57.9 | 16.7 | 47.9 | 5.1 | 95.9 | 55.0 | | R-HORIZON (n=2) | 95.4 | 21.4 | 65.4 | 34.1 | 49.6 | 10.0 | 94.1 | 80.6 | | R-HORIZON (n=4) | 94.6 | 50.6 | 62.9 | 34.8 | 45.4 | 8.1 | 91.9 | 79.1 |

Quick Start

Installation

# Clone the repository
git clone https://github.com/meituan-longcat/R-HORIZON.git
cd R-HORIZON

# Create conda environment
conda create -n r-horizon python=3.10 -y
conda activate r-horizon

# Install PyTorch
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation

# Install additional dependencies
pip install -r requirements.txt

Benchmark Evaluation

1. Download the R-HORIZON Benchmark

# Download benchmark datasets
python ./evaluation/data/download.py

2. Modify config.json under evaluation directory

{
"inference": {
// model_key (e.g. r1-distill-qwen7b) is for run.sh
"r1-distill-qwen7b": {
// the ip and port used in vllm server
"base_url": "http://{Your IP and Port}/v1/completions",
"api_key": "EMPTY",
// model_name is corresponding to the modelname in vllm server
"model_name": "{vllm's modelname}",
"params": {
"temperature": 1.0,
"top_p": 0.95,
"top_k": 10,
"max_tokens": 65536
},
"prompt_prefix": "user:\n",
"prompt_suffix":...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New repo with 26 stars, routine