basetenlabs/prime-rl
forked from PrimeIntellect-ai/prime-rl
Captured source
source ↗basetenlabs/prime-rl
Description: Async RL Training at Scale
Language: Python
License: Apache-2.0
Stars: 1
Forks: 0
Open issues: 18
Created: 2026-02-15T01:09:10Z
Pushed: 2026-06-04T21:10:19Z
Default branch: main
Fork: yes
Parent repository: PrimeIntellect-ai/prime-rl
Archived: no
README:
---
PRIME-RL: Async RL Training at Scale
---
Overview
PRIME-RL is a framework for large-scale asynchronous reinforcement learning. It is designed to be easy-to-use and hackable, yet capable of scaling to 1000+ GPUs. Beyond that, here is why we think you might like it:
1. Integrates natively with `verifiers` environments via the Environments Hub 2. Supports end-to-end post-training, including SFT and RL training and evals 3. Multi-node deployment with FSDP2 training and vLLM inference backend 4. Designed for asynchronous agentic RL training at scale 5. Hackable, modular and extensible by nature
Setup
> *We develop and test on NVIDIA RTX 3090/4090/5090, A100, H100, H200, and B200. If your setup fails, please create an issue.*
Prerequisites
Currently, you need at least one NVIDIA GPU to use PRIME-RL. If you don't already have access to one, we recommend our compute platform for everything from renting on-demand single GPUs for developing, debugging and small ablations, to reserving 1000+ GPU clusters for production-scale training.
Quick Setup
Set up PRIME-RL in a single command.
curl -sSL https://raw.githubusercontent.com/PrimeIntellect-ai/prime-rl/main/scripts/install.sh | bash
Manual Setup
1. Clone the repository
git clone https://github.com/PrimeIntellect-ai/prime-rl.git cd prime-rl
2. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env
3. Install dependencies from the lock file
uv sync --all-extras
3.1. Optional: Install Flash Attention 3 (on Hopper GPUs only, for flash_attention_3 attention backend)
> *NOTE*: This step will take a while, as it builds the Flash Attention 3 extension from source, as it has no wheels prebuilt. > *NOTE*: After this step, you can't run uv sync --all-extras or uv run as it will uninstall the package, you can avoid it by running uv sync --inexact or uv run --no-sync
uv pip install "flash-attn-3 @ git+https://github.com/Dao-AILab/flash-attention.git@main#subdirectory=hopper" --no-build-isolation
Validate your environment setup
1. Check that the environment uses Python 3.12
uv run python -V
2. Check that flash-attn is installed
uv run python -c "import flash_attn"
3. Check that you can run SFT trainer (*this requires 1 GPU*)
uv run sft @ configs/debug/sft/train.toml
4. Check that you can run the RL trainer (*this requires 1 GPU*)
uv run trainer @ configs/debug/rl/train.toml
5. Check that you can run the inference server (*this requires 1 GPU*)
uv run inference @ configs/debug/infer.toml
*Keep the inference server running in the background for the next steps.*
5.1. Check that you can run the orchestrator against the inference server
uv run orchestrator @ configs/debug/orch.toml
5.2. Check that you can run evals against the inference server
uv run eval @ configs/debug/eval.toml
Additional Setup
1. If you want to log your runs to W&B, log in
uv run wandb login # Or set `export WANDB_API_KEY=...`
2. If you require gated/ private models or datasets from HuggingFace, log in
uv run hf auth login # Or set `export HF_TOKEN=...`
Training Examples
We provide end-to-end training examples in the [examples](examples) directory to highlight features of the framework and guide you through the process of training your own models. 1. [Reverse Text](examples/reverse_text/README.md): Train Qwen3-0.6B to reverse a small chunk of text. Demonstrates tiny-scale single-turn SFT and RL training. Can be trained on a single consumer GPU in a few minutes, and is ideal for getting started. 2. [Wordle](examples/wordle/README.md): Train Qwen3-1.7B to play Wordle. A fun example of multi-turn SFT and RL training. Can be trained on a 2-4 H100 GPUs in a few hours. Ideal for exploring the multi-turn training capabilities of the framework. 3. [Alphabet Sort](examples/alphabet_sort/README.md): Train Qwen3-4B-Instruct-2507 to sort names alphabetically. Demonstrates multi-turn RL training via LoRA without SFT warmup. Can be trained on a single H100 GPU in just over an hour. Ideal for exploring LoRA-based training. 4. [Wiki Search](examples/wiki_search/README.md): Train Qwen3-4B-Instruct-2507 to answer trivia questions by searching through a Wikipedia. Demonstrates multi-turn with web search tool use.
4. *More to come...*
Docs
Check out the [docs](docs) directory for in-depth guides on how to use PRIME-RL.
- [Entrypoints](docs/entrypoints.md) - Overview of the main components (orchestrator, trainer, inference) and how to run SFT, RL, and evals
- [Configs](docs/configs.md) - Configuration system using TOML files, CLI arguments, and environment variables
- [Environments](docs/environments.md) - Installing and using verifiers environments from the Environments Hub
- [Async Training](docs/async.md) - Understanding asynchronous off-policy training and step semantics
- [Logging](docs/logging.md) - Logging with loguru, torchrun, and Weights & Biases
- [Checkpointing](docs/checkpointing.md) - Saving and resuming training from checkpoints
- [Benchmarking](docs/benchmarking.md) - Performance benchmarking and throughput measurement
- [Deployment](docs/deployment.md) - Training deployment on single-GPU, multi-GPU, and multi-node clusters
- [On-Policy Distillation](docs/on_policy_distillation.md) - Self-distillation with EMA teacher and top-K tail KL divergence
- [Bring Your Own Algorithms](docs/bring-your-own-algorithms.md) - Custom loss functions, advantage functions, and reward shaping
-…
Excerpt shown — open the source for the full document.
Notability
notability 1.0/10Fork with 1 star, trivial