Tencent-Hunyuan/DRIVE-RLVR
Captured source
source ↗Tencent-Hunyuan/DRIVE-RLVR
Stars: 9
Forks: 1
Open issues: 0
Created: 2025-11-10T11:00:41Z
Pushed: 2025-11-12T09:46:18Z
Default branch: main
Fork: no
Archived: no
README:
📖 Paper • 📙 SFT Model • 📘 RL Model • 📜 Citation
-----
Abstract
Recent reasoning-first models have spurred a resurgence of interest in RLVR (Reinforcement Learning with Verifiable Reward). However, advances are dominated by mathematics, with competitive-programming code generation being relatively underexplored. This work investigates how to construct RLVR datasets and presents practical training techniques that yield strong performance.
Our pipeline begins with Supervised Fine-Tuning (SFT) distilled from strong open-source models. This is followed by a two-stage RL process using executable, testcase-driven rewards:
1. Stage 1 (Entropy Expansion): Training on a large, uniformly distributed set of problems with moderate rollouts (8) and a shorter context (24k) to expand entropy and mitigate repetition. 2. Stage 2 (Hard-Focus Curriculum): Updating on a small, high-quality set of *challenging* problems using Pre-GRPO with a large rollout budget (64) under a hard-focus curriculum.
We implement our method on Qwen2.5-32B and achieve state-of-the-art performance among models of similar scale, comparable to leading systems like DeepSeek v3.1.
🚀 The DRIVE Pipeline
Our training pipeline consists of two main phases: Supervised Fine-Tuning (SFT) and a Two-Stage Reinforcement Learning process, as illustrated below.

> *Figure 2: The training pipeline of our models.*
Phase 1: Supervised Fine-Tuning (SFT)
We begin by fine-tuning Qwen2.5-32B. The key innovation in this stage is Difficulty-Aware Sampling:
- We first classify all competitive programming prompts into three categories: easy, medium, and hard.
- To force the model to focus on more challenging problems, we duplicate hard samples twice in the final SFT dataset.
- We also augment this with general-purpose coding and reasoning-intensive data to improve overall capabilities.
Phase 2: Two-Stage Reinforcement Learning
After SFT, the model still suffers from low entropy, repetitive generation, and poor performance on hard problems. Our two-stage RL process directly addresses this.
Stage 1: Entropy Expansion
- Goal: Increase output diversity and reduce repetitive patterns.
- Data: A large, uniformly distributed set of \~9k problems.
- Method: We use 8 rollouts and a shorter 24k token length. As shown in Figure 3, this "24k-style" training (blue line) successfully increases entropy, while standard training (orange line) leads to entropy collapse.

> *Figure 3: The entropy comparison of 24k-style training and 32k-style training.*
Stage 2: Hard-Focus Curriculum
- Goal: Master the most challenging problems.
- Data: A small, high-quality set of difficult problems (e.g., the 72, 50, and 32 hardest cases from LiveCode V6).
- Method: We apply a "hard-focus curriculum" that progressively retains only the most difficult instances. Crucially, we use a large rollout budget (64-80 rollouts) in this stage, which we found essential for stable gains on hard problems.
📊 Key Results
Our final 32B model, DRIVE-RL, achieves state-of-the-art performance among similarly sized models and is competitive with larger 64k-context models.

> *Figure 1: Performance of our models on various benchmarks.*
Pass@1 Performance Comparison
The two-stage RL pipeline provides significant improvements over the SFT baseline, particularly on challenging benchmarks. We see a +58.3% relative improvement on Codeforces OJ.
| Model | LiveCode 08-11 | LiveCode V5 | LiveCode V6 | LeetCode Weekly (32) | Codeforces OJ (33) | | :--- | :---: | :---: | :---: | :---: | :---: | | DeepseekV3.1 (64k) | 0.692 | 0.713 | 0.693 | 0.688 | 0.161 | | Seed1.6-0715 (64k) | 0.803 | 0.824 | 0.770 | 0.743 | 0.188 | | Qwen3-235B-2507 (64k)| 0.681 | 0.713 | 0.646 | 0.688 | 0.200 | | --- | --- | --- | --- | --- | --- | | SFT model (32k) | 0.602 | 0.594 | 0.549 | 0.578 | 0.115 | | RL Stage 1 model (24k) | 0.625 | 0.627 | 0.634 | 0.603 | 0.112 | | DRIVE-RL model (32k) | 0.699 | 0.697 | 0.703 | 0.653 | 0.182 | | *Rel. Improvement (RL vs SFT)* | *+16.1%* | *+17.3%* | *+28.1%* | *+13.0%* | *+58.3%* |
*(Data sourced from Table 2 in our paper)*
Key Findings
1. Difficulty-aware training is crucial: Standard RL struggles with hard problems. Our hard-focus curriculum (Stage 2) is essential for pushing the model's capabilities. 2. Entropy expansion is necessary: Skipping Stage 1 (Entropy Expansion) and training *only* on hard cases hurts generalization to out-of-distribution benchmarks. Both stages are necessary. 3. Large rollouts for hard problems: A large rollout budget (e.g., 64+) is essential for mastering challenging cases. 4. Scaling: The DRIVE strategy shows strong, positive scaling trends when applied to a large-scale internal MoE model.
📜 Citation
If you find this work useful, please cite our paper:
@misc{zhu2025drivedatacurationbest,
title={DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation},
author={Speed Zhu and Jianwei Cai and Guang Chen and Lulu Wu and Saiyong Yang and Wiggin Zhou},
year={2025},
eprint={2511.06307},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.06307},
}License
This repository contains two separate licenses for different models:
- DRIVE-RL Model: Licensed under [LICENSE - DRIVE-RL.txt](LICENSE%20-%20DRIVE-RL.txt)
- DRIVE-SFT Model: Licensed under [LICENSE - DRIVE-SFT.txt](LICENSE%20-%20DRIVE-SFT.txt)
Please refer to the respective license file for the model you are using.
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Low traction, routine new repo