ForkNovita AINovita AIpublished Jan 20, 2026seen 5d

novitalabs/R2E-Gym

forked from R2E-Gym/R2E-Gym

Open original ↗

Captured source

source ↗
published Jan 20, 2026seen 5dcaptured 9hhttp 200method plain

novitalabs/R2E-Gym

Description: [COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents

Language: Python

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 0

Created: 2026-01-20T10:28:49Z

Pushed: 2026-01-20T12:31:20Z

Default branch: main

Fork: yes

Parent repository: R2E-Gym/R2E-Gym

Archived: no

README: R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents

Naman Jain*,1, Jaskirat Singh*,2, Manish Shetty1, Liang Zheng2, Koushik Sen1, Ion Stoica1

1UC Berkeley, 2ANU *Equal contribution, ^Equal supervision

📃 Paper • 🤗 Data & Models •

🌐 Project Page

---

🚨 UPDATES

🔥 NEW: DeepSWE Models Available! We've released DeepSWE, our latest state-of-the-art SWE agent models that achieve exceptional performance on SWE-Bench trained with **rLLM**.

  • 🤗 Model: agentica-org/DeepSWE-Preview
  • 📋 Reproduction Guides: Check out our detailed reproduction guides in the [reproduction/](./reproduction/) folder:
  • [DEEPSWE_REPRODUCTION.MD](./reproduction/DEEPSWE_REPRODUCTION.MD) - Complete guide for reproducing DeepSWE results
  • [DEEPSWE_TTS_REPRODUCTION.MD](./reproduction/DEEPSWE_TTS_REPRODUCTION.MD) - Test-time scaling reproduction guide

---

We present R2E-Gym, the largest procedurally curated environment for training real-world SWE-Agents. We show that R2E-Gym enables more scalable train and test-time scaling, achieving 51% on the SWE-Bench Verified benchmark, reflecting a new state-of-the-art for open-weight SWE-Agents and for first time being competitive with proprietary models such as o1 and sonnet-3.5-v2 with tools.

![!R2E-Gym Environment](./assets/docs-teaser-v1.png)

R2E-Gym is powered by two main contributions: (a) SWE-GEN: a synthetic data curation recipe for curating executable training environments w/o relying on human tests and issues. (b) Hybrid Inference Time Scaling: showing that while both execution-based and execution-free verifiers elicit inference-time gains; significantly better performance can be achieved by leveraging the strengths of both. (c) Overall, the final approach reflects SOTA performance for open-weight SWE-Agents, while also being competitive with some proprietary model baselines.

---

> While LLM-based SWE-Agents have demonstrated remarkable improvements, state-of-the-art performance is largely driven by proprietary models — with open-models lagging behind. Closing this performance gap requires addressing two core challenges: First, we need scalable methods to curate diverse, high-quality execution environments for training. Second, we need efficient strategies for scaling test-time compute. R2EGym presents a joint framework for address both these challenges.

R2E-Gym Environment

We create R2E-Gym, the largest procedurally curated gym environment for training real-world SWE-Agents, — consisting of more than 8.1K problems across 13 repos, with executable gym environments, unit tests, and natural-language task descriptions.

Synthetic Data Enables Scalable Agent Training

R2E-Gym is powered by SWE-GEN — a novel synthetic data curation recipe that enables collection of a large number of executable training environments without reliance on human-written pull requests (PRs) or unit tests. We show that instead of using human-written PRs, good-quality execution environments can directly be curated from commits. Compared to PR-based data collection, we find that this approach enables more scalable data curation and agent-training, resulting in a SOTA pass@1 performance of 34.4% on the challenging SWE-Bench Verified benchmark.

Hybrid Test-time Scaling

Finally, we introduce Hybrid Test-time Scaling, a novel paradigm for scaling test-time compute. We show that while both execution-based and execution-free verifiers elicit inference-time gains; they exchit complementary strengths and weakness. Leveraging the strengths of each approach allows significantly better performance when scaling test-time compute - resulting in a 51% pass@1 performance on the SWE-Bench Verified benchmark, reflecting a new state-of-the-art for open-weight SWE-Agents.

---

🔧 Setup

> [!IMPORTANT] > Installation is required!

## Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

# activate venv
uv venv
source .venv/bin/activate
uv sync && uv pip install -e .

🚀 Quickstart

  • Usage: R2E-Gym environment can be simply used as:
from r2egym.agenthub.environment.env import EnvArgs, RepoEnv
from r2egym.agenthub.agent.agent import AgentArgs, Agent
from pathlib import Path
from datasets import load_dataset

# load gym dataset [R2E-Gym/R2E-Gym-Subset, R2E-Gym/R2E-Gym-Full, R2E-Gym/SWE-Bench-Verified, R2E-Gym/SWE-Bench-Lite]
ds = load_dataset("R2E-Gym/R2E-Gym-Lite")
split = 'train' # split of the dataset [train, test]

# load gym environment
env_index = 100 # index of the environment [0, len(ds)]
env_args = EnvArgs(ds = ds[split][env_index])
env = RepoEnv(env_args)

# load agent
agent_args = AgentArgs.from_yaml(Path('./src/r2egym/agenthub/config/edit_fn_calling.yaml'))
# define llm: ['claude-3-5-sonnet-20241022', 'gpt-4o', 'vllm/R2E-Gym/R2EGym-32B-Agent']
agent_args.llm_name = 'claude-3-5-sonnet-20241022'
agent = Agent(name="EditingAgent", args=agent_args)

# run the agent (note: disable fn_calling for R2E-Gym agents)
output = agent.run(env, max_steps=40, use_fn_calling=True)

> [!NOTE] > The output of the agent is a Trajectory object, which contains detailed stats including full agent trajectory, problem statement, max execution time, exit-reason, and output patch. Please refer src/r2egym/agenthub/agent/agent.py and src/r2egym/agenthub/trajectory/trajectory.py for more details.

  • Reward Calculation: All R2E-Gym environments support automated reward calculation using unit tests.
# calculate reward
out = env.runtime._calculate_reward()
  • Gym Environment Stats: The detailed stats for each environment (including natural language task description, repo name, ground truth patch) can be easily accessed as,
# get the environment stats
env_stats_dict = env.get_stats()

> [!TIP] > R2EGym environments also offer a range of other convenient functions, such as apply_patch,…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine fork, no notable traction