OpenBMB/DeepThinkVLA
Python
Captured source
source ↗OpenBMB/DeepThinkVLA
Description: DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
Language: Python
License: MIT
Stars: 525
Forks: 48
Open issues: 3
Created: 2025-10-13T04:37:37Z
Pushed: 2026-04-16T10:43:05Z
Default branch: main
Fork: no
Archived: no
README:
🔥 DeepThinkVLA 🔥
Enhancing Reasoning Capability of Vision-Language-Action Models
DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
🔗 Quick Links
- [Overview](#overview)
- [Highlights](#highlights)
- [Architecture](#architecture)
- [Embodied CoT Dataset](#embodied-cot-dataset)
- [Training Pipeline](#training-pipeline)
- [Performance](#performance)
- [LIBERO Plus Zero-shot Evaluation](#-libero-zero-shot-evaluation)
- [Qualitative Behavior](#qualitative-behavior)
- [Setup](#setup)
- [Data & Checkpoints](#data--checkpoints)
- [Experiments](#experiments)
- [Repository Structure](#repository-structure)
- [Star History](#star-history)
- [Acknowledgements](#acknowledgements)
- [References](#references)
📰 News
- 2026-01-20: Added LIBERO Plus zero-shot evaluation instructions + results (see the standalone eval repo: `wadeKeith/DeepThinkVLA_libero_plus`).
📝 TODO
- [x] LIBERO benchmark
- [x] LIBERO Plus zero-shot evaluation
- [ ] RobotWin benchmark
- [ ] Real-world hardware experiments
🧠 Overview
DeepThinkVLA rethinks Vision-Language-Action (VLA) policies with explicit deliberation. Starting from the public pi0-FAST checkpoint, we refactor the policy into a 2.9B parameter hybrid decoder that writes a reasoning trace before emitting action chunks. The accompanying paper combines embodied Chain-of-Thought (CoT) supervised fine-tuning with outcome-driven reinforcement learning, yielding a 97.0% average success rate across the LIBERO benchmark (Object 99.0, Spatial 96.6, Goal 96.4, Long 96.2). The hybrid architecture alone lifts success by 15.5 percentage points over a naive autoregressive CoT variant, and the RL refinement supplies the final +2.0 point boost on LIBERO-Long.
✨ Highlights
- Hybrid attention decoder cleanly separates autoregressive reasoning from parallel action generation, closing the latency gap while keeping control precise.
- Two-stage CoT data engine distills key frames with a cloud LVLM and scales to full trajectories via a fine-tuned local VLM.
- Outcome-based RL with grouped credit assignment aligns the full think-act sequence and stabilizes updates with KL regularization to the SFT policy.
- Masked-CoT(DeepThinkVLA) inference preserves accuracy (96.5% average SR) while running 0.175x the latency of pi0-FAST(Autoregressive), whereas random CoT quickly degrades performance (85.1%).
🏗️ Architecture

DeepThinkVLA inserts a `` segment between observations and actions. Reasoning tokens are generated autoregressively, after which the decoder switches to bidirectional attention to emit action vectors in parallel. This resolves the modality conflict that limits single-decoder baselines and enables efficient rollouts for downstream reinforcement learning.
📦 Embodied CoT Dataset

A scalable annotation pipeline supplies paired reasoning/action traces:
- Stage 1 isolates key frames via gripper-state heuristics, queries a cloud LVLM for high-quality CoT, and performs targeted human review.
- Stage 2 fine-tunes a local VLM on those exemplars and auto-labels the remaining frames, applying schema and temporal checks to keep trajectories coherent.
🔄 Training Pipeline

Training proceeds in two stages:
- SFT cold start: token-level cross-entropy teaches the hybrid decoder to produce well-formed CoT and aligned actions under causal/bidirectional masks.
- Outcome-driven RL: grouped reinforcement policy optimization (GRPO) standardizes sparse rewards inside task-conditioned batches, while a KL penalty to the SFT policy prevents drift. The RL stage adds +2.0 SR on LIBERO-Long and strengthens the causal link between thought and action.
📊 Performance

- DeepThinkVLA reaches a 97.0% average success rate across LIBERO, outperforming autoregressive, diffusion, and parallel-decoding baselines under the single-model protocol.
- RL-over-SFT lifts LIBERO-Long from 94.2% to 96.2% without extra demonstrations, demonstrating recoveries on long-horizon tasks.
- The hybrid decoder outperforms the naive autoregressive CoT variant by 15.5 points and keeps latency manageable; Mask CoT inference keeps accuracy while running 0.175x pi0-FAST latency.
🧪 LIBERO Plus Zero-shot Evaluation
We additionally report zero-shot transfer performance on LIBERO Plus:
- Training: the model is trained only on the standard LIBERO dataset (no LIBERO Plus fine-tuning).
- Evaluation: the trained model is directly evaluated on LIBERO Plus (zero-shot).
- Eval scripts: we maintain a lightweight, standalone evaluation repo here:
- `wadeKeith/DeepThinkVLA_libero_plus`
Run (in the LIBERO Plus eval repo)
python experiments/run_libero_plus_eval.py \ --pretrained_checkpoint /path/to/deepthinkvla_libero_checkpoint \ --num_images_in_input 2 \ --task_suite_name libero_10 \ --max_new_tokens 2048 \ --swanlab_mode disabled
Or use the wrapper:
bash eval.sh
Outputs
- Logs:
experiments/logs/ - Rollout videos (if enabled):
rollouts/
Zero-shot Results (LIBERO Plus)
The following numbers are zero-shot success rates (SR) on LIBERO Plus, evaluated with a DeepThinkVLA model trained only on LIBERO (no LIBERO Plus fine-tuning).
Breakdown by shift type
| Objects Layout | Language Instructions | Light Conditions | Camera Viewpoints | Robot Initial States | Background Textures | Sensor Noise | Total | | -------------- | --------------------- | ---------------- | ----------------- | -------------------- | ------------------- | ------------ | ----- | | 0.7993 | 0.845 | 0.900 | 0.885 | 0.405 | 0.753 | 0.944 | 0.790 |
Breakdown by task suite
| object | spatial | goal | 10 | Total | | ------ | ------- | ----- | ----- | ----- | | 0.840 | 0.879 | 0.697 | 0.746 | 0.790 |
🎬 Qualitative Behavior
 Deliberate…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New VLA model repo, moderate stars