AReaL: Ant Reasoning Reinforcement Learning for LLMs
Captured source
source ↗AReaL: Ant Reasoning Reinforcement Learning for LLMs | INCLUSION AI
Skip to main content
| Paper | Documentation | Ask DeepWiki | 🤗 Models & Data | WeChat Group |
AReaL (Ant Reasoning RL) is an open-source fully asynchronous reinforcement learning training system for large reasoning models developed at the RL Lab, Ant Research . Built upon the open-source project RealHF , we are fully committed to open-source by providing training details, data, and infrastructure required to reproduce results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea because it's delicious, customizable, and affordable. We hope you enjoy our project just like how you enjoy real-world milk tea (cheers).
AReaL Highlights
🔥 [NEW] Asynchronous RL: With algorithm-system co-design, AReaL supports fully asynchronous RL for the fastest training ! Experimental support for multi-turn agentic RL is also provided.
🛠️ Open & Reproducible : We continuously release all code, datasets, and training recipes for RL training of LLMs.
🚀 Scalability : AReaL can seamlessly adapt to different computational resource settings, ranging from a single node to 1K GPUs.
🔪 Cutting-Edge Performance: AReaL can produce models with cutting-edge reasoning capabilities in math and coding. We are also actively working on agentic tasks.
News
[2025/06/03] (v0.3, boba²) We release boba² (double-boba) for fully asynchronous RL training, which achieves a 2.77x speedup while obtaining on-par or even better training performance compared to synchronous systems. Moreover, asynchronous RL makes it extremely easy to set up multi-turn agentic RL training! Check out our v0.3 overview blog and the research paper .
[2025/03/31] (v0.2, boba) Here comes our next milestone release - boba! Please call it A-ReaL-boba! This release includes much faster training with SGLang support and SOTA 7B and 32B models on math reasoning. Check our v0.2 technical blog .
[2025/02/24] (v0.1) Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our v0.1 technical blog .
Release Highlights
In our AReaL-boba² (A-ReaL-double-boba) release, we highlight the top 3 most important features:
A fully asynchronous RL training pipeline with system and RL algorithm co-design , achieving over 2.77x speedup without any performance drop. Check the benchmark scripts and instructions here .
SOTA coding models, i.e., a 14B model with a 69.1 score on LCB-v5 . To reproduce, check the configs and instructions .
Experimental support for multi-turn agentic RL training. Check our complete example .
For the complete system design and more training details, please check our v0.3 blog and our research paper .
Jump to the quickstart section if you want to quickly run an experiment and get your hands dirty! 😈
Overview of Asynchronous RL Training
During the synchronous RL training process, a generation step must wait until the longest sequence completes within the batch of LLM outputs. Due to the varying output lengths for LRMs, a synchronous RL system suffers from massive GPU idle time, leading to training inefficiency. Some recent works ( DeepCoder , Intellect ) propose overlapping a single training step with a single generation step to accelerate training. However, the largest bottleneck remains unchanged: the samples within a batch are still from the same model version, leading to waiting and GPU idle time.
Fig.1. Left: Execution timeline of synchronous RL training. Right: Execution timeline of one-step overlap RL system.
AReaL adopts a fully asynchronous RL training framework that completely decouples generation from training. In AReaL, LLM generation runs in a streaming manner, with each rollout worker continuously producing outputs without waiting. Meanwhile, trainer workers perform parallel model updates upon receiving training batches.
Fig 2. Execution timeline of our fully asynchronous RL system.
AReaL follows a system-algorithm co-design principle: on the system side, AReaL efficiently syncs model parameters and carefully controls the staleness of each training sample; on the algorithm side, AReaL improves the objective of PPO to make async-RL stable.
We compare the scalability of asynchronous RL training based on our AReaL-boba² system with classical synchronous RL training (we adopt the fastest open-source system veRL, main branch on 05/07/2025) across different model sizes and different numbers of H800 GPUs. AReaL demonstrates much improved scaling capabilities with respect to training throughput. This is also partially due to AReaL decoupling training and generation, leading to much fewer GPU memory fragments.
Fig.3 The scaling trend of asynchronous RL (based on AReaL-boba2) and classical synchronous RL (based on veRL) with different model sizes. Dotted lines indicate ideal linear scaling.
SOTA Code Generation Model by AReaL-boba²
We use Qwen3 as our base model. After asynchronous RL training, we achieve SOTA results on LiveCodeBench, Codeforces, and CodeContests benchmarks.
Model (8B) LiveCodeBench v5 (2024.10-2025.2) Codeforces CodeContests Qwen3-8B 58.8 1879/96.7% 31.4 DeepSeek-R1-0528-Qwen3-8B 58.4 1945/97.3% 31.0 🤗 AReaL-boba²-8B-Open 62.0 1933/97.2% 41.4 🤗 AReaL-boba²-8B 63.0 1962/97.5% 40.8
Model (14B) LiveCodeBench v5 (2024.10-2025.2) Codeforces CodeContests Qwen3-14B 65.4 1978/97.7% 38.3 DeepCoder-14B-Preview 60.6 1936/95.3% 40.1 🤗 AReaL-boba²-14B-Open 67.3 1990/97.8% 46.2 🤗 AReal-boba²-14B 69.1 2044/98.2% 46.1
Larger Models LiveCodeBench v5 (2024.10-2025.2) Codeforces CodeContests Qwen3-235B 70.7 2056 - DeepSeek-R1 64.3 2029 - OpenAI-o3-mini (Medium) 66.3 2036 -
Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-source data. AReaL-boba²-8B/14B models are trained with an additional small amount of internal data and achieve SOTA performance on LiveCodeBench, Codeforces & CodeContests.
We highlight the tutorials and code walkthroughs about the following key features for asynchronous training:
Streaming generation and reward computation
Interruptible rollout
Data staleness control with the rollout controller
The adoption of decoupled PPO loss
RL Training for Multi-turn Agent
AReaL-boba² allows you to independently customize the dataset , rollout behavior , and the training algorithm , without needing to modify the heavy system-level code.
In particular, we show a…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Substantive research post on RL for LLMs