RepoMiniMaxMiniMaxpublished May 23, 2025seen 5d

MiniMax-AI/One-RL-to-See-Them-All

Python

Open original ↗

Captured source

source ↗

MiniMax-AI/One-RL-to-See-Them-All

Description: The official repo of One RL to See Them All: Visual Triple Unified Reinforcement Learning

Language: Python

License: MIT

Stars: 333

Forks: 17

Open issues: 6

Created: 2025-05-23T08:03:39Z

Pushed: 2025-05-31T12:43:31Z

Default branch: main

Fork: no

Archived: no

README:

One RL to See Them All

We propose V-Triune (Visual Triple Unified Reinforcement Learning), a unified Reinforcement Learning (RL) system designed to advance Vision-Language Models (VLMs). It enables VLMs to jointly learn and master both visual reasoning and perception tasks within a single training pipeline. Our model, Orsta, trained with this approach, demonstrates how one RL framework can empower VLMs to "See Them All", delivering significant performance boosts across a diverse range of visual tasks.

V-Triune consists of three complementary components: Sample-Level Data Formatting (unifies diverse task inputs), Verifier-Level Reward Computation (delivers custom rewards via specialized verifiers), and Source-Level Metric Monitoring (diagnoses problems at the data-source level).

Key Features

What makes V-Triune and Orsta stand out:

  • Unified RL Framework 🤖: V-Triune is the *first* system to enable VLMs to jointly master visual reasoning (e.g., Math, Puzzles) and perception (e.g., Detection, Grounding) within a *single*, streamlined RL training pipeline.
  • High-Performance Orsta Models 🚀: Trained using our V-Triune system on 8 diverse tasks (4 reasoning + 4 perception), Orsta models (ranging from 7B to 32B) achieve *substantial* performance gains—up to +14.1% on the comprehensive MEGA-Bench Core—demonstrating the effectiveness and scalability of our unified approach.
  • Novel Dynamic IoU Reward 🎯: We introduce an *innovative* Dynamic IoU reward mechanism that provides adaptive, progressive feedback. This significantly improves stability and performance, particularly on challenging visual perception tasks.
  • Open & Accessible 🌐: Both the V-Triune system and the high-performance Orsta models are publicly available, encouraging further research and development in VLM training.

News

  • [2025/05/23] 🎉 We are excited to release our technical report! You can read the paper here.

Main Results

Below we present the main results for our Orsta models, focusing on training dynamics and performance specifically on the MEGA-Bench Core benchmark.

🚀 Get Started

Introduction

This guide will walk you through setting up the environment, installing necessary dependencies, configuring a Ray cluster, and setting up experiment parameters to run *One-RL-to-See-Them-All*.

---

🛠️ Installation

Follow these steps to prepare your environment and install the required packages.

We provide Dockerfile in the docker/ directory for containerized setup. Alternatively, you can configure your environment using Conda as described below.

1. Create and Activate Conda Environment: We recommend using Python 3.12.

conda create -n v_triune python=3.12
source activate v_triune

2. Install PyTorch (CUDA 12.4): This project is optimized for PyTorch 2.6.0 with CUDA 12.4 support.

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

3. Install FlashAttention: FlashAttention (version 2.7.3) is used for efficient attention mechanisms. Ensure ninja is correctly installed.

pip uninstall -y ninja && pip install ninja
pip install flash-attn==2.7.3 --no-build-isolation

4. Install v_triune: Clone the *One-RL-to-See-Them-All* repository and install it in editable mode.

git clone https://github.com/MiniMax-AI/One-RL-to-See-Them-All.git
cd One-RL-to-See-Them-All
pip install -e .

---

📦 Data Preparation

This section outlines how to download and structure the Orsta-Data-47k dataset.

1. Download Dataset:

Download the dataset using the Hugging Face CLI:

huggingface-cli download \
--repo-type dataset \
--resume-download \
One-RL-to-See-Them-All/Orsta-Data-47k \
--local-dir Orsta-Data-47k

2. Dataset Structure & Format: All data files are in Parquet (.parquet) format. The Orsta-Data-47k directory will be structured as:

Orsta-Data-47k/
├── test/
│   ├── test_chart_megabench_176.parquet
│   └── ... (other test files)
└── train/
├── train_chart_chartqapro_498.parquet
└── ... (other train files)

3. File Naming:

Files follow the convention: {split}_{task_name}_{source_name}_{num}.parquet

{split}: train or test. {task_name}: General task category (e.g., chart, science). {source_name}: Specific data benchmark/origin. {num}: Number of samples in the file.

4. Split Usage:

train/: Corpus for training the model.

test/: Samples for online evaluation during training, used to diagnose learning progress per task.

---

🌐 Ray Cluster Setup

For distributed training, set up a Ray cluster. Here's an example for a 2-node cluster, each with 8 GPUs.

1. Start the Head Node: Run this command on your designated head node. The dashboard will be accessible via http://:8265.

ray start --head --dashboard-host=0.0.0.0

Note down the address provided (e.g., xxxxxx:6379).

2. Start Worker Node(s): Run this command on each worker node, replacing xxxxxx:6379 with the address from the head node.

ray start --address=xxxxxx:6379

3. Verify Cluster Status: On the head node, run ray status to confirm that all nodes have joined and all GPUs (16 in this example) are detected.

---

🏆 Reward Server

This section describes how to launch a remote reward server, which is used to calculate reward values during the training process.

To start the reward server, execute the following command:

bash scripts/reward_server.sh

This script utilizes several configurable parameters:

DET_IOU_THRESHOLD: Defines the threshold strategy for the Intersection over Union (IoU) reward.

PORT: Specifies the network port on which the reward server will listen for incoming requests.

WORKERS: Sets the number of worker processes for the server.

Upon successful execution, the script launches a FastAPI service. A file named with a unique JOB_ID will be created within the .reward_server/ directory in your project. This JOB_ID file contains the IP address and PORT of the running…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New RL repo with 333 stars