WritingTogether AITogether AIpublished Feb 25, 2026seen 5d

CoderForge-Preview: SOTA open dataset for training efficient coding agents

Open original ↗

Captured source

source ↗

CoderForge-Preview: SOTA open dataset for training efficient coding agents

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Research

Published 2/25/2026

CoderForge-Preview: SOTA open dataset for training efficient coding agents

Authors

By Alpay Ariyak*, Junda Zhang, Junxiong Wang, Shang Zhu, Federico Bianchi, Sanjana Srivastava, Ashwinee Panda, Siddhant Bharti, Chenfeng Xu, John Heo, Xiaoxia Shirley Wu, James Zou, Percy Liang, Leon Song, Ce Zhang, Ben Athiwaratkun, Zhongzhu Zhou*, Qingyang Wu* *Project Core Leads

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

Links in this article

Dataset ‍ 32B Evaluation Trajectories

We release CoderForge-Preview - the largest open test-verified coding agent dataset. By leveraging it to fine-tune Qwen-3 32B, we boost S WE-Bench Verified performance 23.0% above the base model reaching 59.4% , ranking #1 among open-data models in the ≤32B parameter range.

As coding agents become increasingly capable, the research community faces a critical bottleneck: the lack of large-scale, high-quality open training data. While proprietary models continue to advance, open-weight alternatives have been held back by limited access to the long-context, test-verified trajectories needed for effective agent training. We're releasing CoderForge-Preview , the largest open dataset of coding agent trajectories to date - 258k test-verified trajectories (155k pass | 103k fail) spanning 51K tasks across 1,655 repositories, and share our results of using it to train 32B and 4B models on it. By releasing CoderForge openly, we aim to accelerate progress across the entire open-source AI community and enable researchers everywhere to build, study, and improve upon our work. Fine-tuning Qwen-3 32B achieves 59.4% pass@1 on SWE-Bench Verified [8], ranking #1 among open-data models in the ≤32B parameter range. We release the full trajectory dataset, as well as the evaluation trajectories for 32B. Dataset ‍ 32B Evaluation Trajectories

CoderForge-Preview Data We generate agent trajectories from three different task sources using Qwen3-Coder-480B and apply rejection sampling to filter out solutions that fail to pass the tests. This process yields 258K long-context trajectories (up to 128K tokens) across 51K tasks, from which we retain 155K high-quality, test-verified trajectories for SFT training. Task sources We draw tasks from three sources: R2E-Gym [5], SWE-Smith [6], and SWE-Rebench [7]:

Table 1: Task counts reflect instances valid in our execution environment.

Source Tasks Unique Repos

R2E-Gym 4,216 9

SWE-Smith 37,221 124

SWE-Rebench 9,764 1,577

Total 51,201 1,655

Setup For the agent scaffold, we integrate OpenHands v0.52.1 scaffold [4] into the R2E-Gym [5] data generation framework. It includes four main tools: bash execution (execute_bash), file editing (str_replace_editor), log thinking (think), and task completion (finish). OpenHands is pre-installed in each Docker evaluation environment, enabling LLM agents to interact with isolated code repositories through a standardized action/observation interface. Each task is executed within an isolated Docker container, where the agent iteratively issues bash commands and file edits for up to 100 steps to generate a final patch. We use Qwen3-Coder-480B as the main model for data generation. We use a temperature of 0.7, a top_p of 0.8, and 32,768 max new tokens. To increase the number of successful trajectories, we generate multiple per problem - 8 for R2E-Gym and SWE-Rebench, and 4 for SWE-Smith. We filter to keep only trajectories whose final patches pass all repository tests. To avoid evaluation leakage, we exclude any tasks that share the same (repository, base_commit) pair or problem statement with SWE-Bench Verified samples. Comparison to other datasets

Table 2: Comparison of CoderForge-Preview with other open coding agent datasets.

Datasets Teacher Context Length Size (Total) Size (Filtered)

R2E-Gym/R2EGym-SFT-Trajectories Claude Sonnet 3.5 20000 3,231 3,231

SWE-bench/SWE-smith-trajectories Claude Sonnet 3.7 32768 49,897 21,513

allenai/SERA [10] GLM-4.6 32768 25,224 25,224

nex-agi/agent-sft (agentic_code) [11] DeepSeek-V3.1-Nex-N1 128000 24,796 24,796

nebius/SWE-rebench-openhands-trajectories [12] Qwen3-Coder-480B 128000 67,074 32,161

CoderForge-Preview Data Qwen3-Coder-480B 128000 258,134 155,144

CoderForge-Preview Data stands out as the largest and best-performing coding-agent trajectory dataset among comparable releases. With 258,134 total and 155,144 successful trajectories at a 128K context length, it substantially exceeds prior datasets both in scale and long-context coverage. Trajectory success by task source

For each trajectory, we run the relevant tests provided with its task to check whether the model has solved it. For R2E-Gym tasks the solve-rate was consistently the highest, rising from 62.9% at Pass@1 to 80.3% at Pass@8 . SWE-Rebench also benefits substantially from multi-attempt sampling, improving from 57.5% to 73.9% by Pass@8. SWE-Smith shows more modest gains, increasing from 58.8% at Pass@1 to 64.9% at Pass@4 . Overall, the trend highlights the effectiveness of multi-sample generation in increasing the yield of successful trajectories, with diminishing but consistent returns as the number of attempts grows. Final Task Source Distribution We filter our generated trajectories based on whether they solved the task successfully, resulting in the task distribution shown below. For our SFT experiments, we only trained on the successful trajectories.

Task Source Trajectories Generated Trajectories Generated (reward = 1.0)

R2E-Gym 32,964 20,904

SWE-Smith 148,001 89,501

SWE-Rebench 77,169 44,739

Total 258,134 155,144

Trajectory Characteristics Data Generation Cost

Table 3: Data generation cost and efficiency metrics across task sources.

Source

Completions API

Prompt Tokens Output Tokens Avg Output Tokens Cache Hit Rate

R2E-Gym 2.18M 59B 404M 185.4 96.64%

SWE-Smith 8.47M 238B 1,544M 182.4 90.15%

SWE-Rebench…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Low traction, dataset release.