Tencent-Hunyuan/SAGE-GRPO
Python
Captured source
source ↗Tencent-Hunyuan/SAGE-GRPO
Description: Official Implementation of SAGE-GRPO:Manifold-Aware Exploration for Reinforcement Learning in Video Generation
Language: Python
License: NOASSERTION
Stars: 123
Forks: 2
Open issues: 2
Created: 2026-03-24T07:59:12Z
Pushed: 2026-04-02T06:08:46Z
Default branch: master
Fork: no
Archived: no
README:
Manifold-Aware Exploration for Reinforcement Learning in Video Generation
Figure 1. Illustration of SAGE-GRPO. (Left) (a.1) At higher noise regions, Euler-style discretization introduces extra energy (discretization error) beyond the true integral. (a.2) Our precise SDE removes unnecessary noise energy in high-noise regions, enabling more precise exploration and a better-learned data manifold. (Right) (b) Our method with improved exploration yields more stable and better-aligned generations compared with DanceGRPO, FlowGRPO, and CPS.
Highlights
We formulate GRPO for video generation as a manifold-constrained exploration problem:
Figure 2. Geometric interpretation of noise injection strategies. Conventional linear SDEs (red) inject exploration noise using first-order approximations, causing off-manifold drift and temporal jitter. Our Manifold-Aware SDE (blue) uses a logarithmic correction term so that exploration noise stays close to the flow trajectory and the video manifold.
- Core Problem: We show that the ODE-to-SDE conversions used in existing video GRPO methods can inject excess noise in high-noise steps, which reduces rollout quality and makes reward-guided updates less reliable.
- Micro-level: We constrain exploration with a *Precise Manifold-Aware SDE* and a *Gradient Norm Equalizer*, so that sampling noise stays manifold-consistent and updates are balanced across timesteps.
- Macro-level: We constrain long-horizon exploration with a *Dual Trust Region* using moving anchors and step-wise constraints, so that the trust region tracks more manifold-consistent checkpoints and prevents drift.
Abstract
Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment.
To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable.
We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a *precise manifold-aware SDE* with a logarithmic curvature correction and introduce a *gradient norm equalizer* to stabilize sampling and updates across timesteps. At the macro level, we use a *dual trust region* with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift.
We evaluate SAGE-GRPO on HunyuanVideo-1.5 using VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality.
Table of Contents
- [Highlights](#highlights)
- [Abstract](#abstract)
- [Installation](#installation)
- [Checkpoint Preparation](#checkpoint-preparation)
- [Post-Training](#post-training)
- [Key Training Parameters](#key-training-parameters)
- [Recommended 64-GPU Default](#recommended-64-gpu-default)
- [Visualization Gallery](#visualization-gallery)
- [Acknowledgements](#acknowledgements)
- [License](#license)
- [Citation](#citation)
Installation
1. Clone the repository
git clone cd SAGE-GRPO
2. Install Python dependencies
pip install -r requirements.txt
3. Download the reward model helper
bash download_weights.sh
4. Download the remaining HunyuanVideo checkpoints
After download_weights.sh, follow checkpoints-download.md to download the remaining base model, text encoder, and vision encoder weights.
Checkpoint Preparation
SAGE-GRPO expects both the HunyuanVideo-1.5 base checkpoints and the VideoReward reward model to be available under ./ckpts.
Useful references:
- Base model documentation:
README_HYVideo.md - Detailed checkpoint download instructions:
checkpoints-download.md - Reward checkpoint helper:
download_weights.sh
Expected Checkpoint Layout
ckpts/ ├── assets ├── config.json ├── LICENSE ├── NOTICE ├── README.md ├── README_CN.md ├── scheduler ├── text_encoder │ ├── byt5-small │ ├── Glyph-SDXL-v2 │ └── llm ├── transformer ├── upsampler ├── vae ├── VideoReward │ ├── checkpoint-11352 │ ├── model_config.json │ └── README.md └── vision_encoder └── siglip
If your local structure differs substantially from the above, training usually fails during model or reward initialization.
Post-Training
Hardware Recommendation
| Requirement | Recommended | | --- | --- | | GPU memory | 80 GB per GPU | | GPU count | 64 GPUs (8 nodes x 8) | | OS | Linux | | PyTorch | 2.6+ |
Single-node multi-GPU
For a single machine with 8 GPUs:
bash run_post_train.sh
This launches post_train.py with the default GRPO configuration via torchrun --nproc_per_node=8.
Multi-node multi-GPU
For multi-node training:
bash run_post_train_multinode.sh
The multi-node entry internally calls:
bash scripts/post_train/pdsh_train.sh "scripts/post_train/train_grpo.sh"
Edit or export the node list and rendezvous-related environment expected by your cluster launcher before starting.
Key Training Parameters
Distributed Training
The three most important distributed-training knobs are sp_size, batch_size, and num_generations.
dp_degree = world_size / sp_size
There is a validity constraint:
(batch_size * dp_degree) % num_generations == 0
| Parameter | Default | Description | | --- | --- | --- | | sp_size | 8 | Sequence parallel degree. Must evenly divide world_size. | | batch_size | 2 | Per-rank video micro-batch size. | | num_generations | 4 | Number of…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New repo from Tencent, moderate stars