Tencent-Hunyuan/Hy-Embodied-0.5-VLA
Python
Captured source
source ↗Tencent-Hunyuan/Hy-Embodied-0.5-VLA
Language: Python
License: NOASSERTION
Stars: 5
Forks: 0
Open issues: 0
Created: 2026-06-12T03:47:21Z
Pushed: 2026-06-15T03:33:09Z
Default branch: main
Fork: no
Archived: no
README:
https://github.com/user-attachments/assets/e472c495-6fa6-4171-ae00-418bdd97473b
🔥 Updates
- `[2026-06-15]` 🚀 We have released Hy-Embodied-0.5-VLA — including the codebase, the `Hy-Embodied-0.5-VLA-UMI` and `Hy-Embodied-0.5-VLA-RoboTwin` models, and the `Hy-Embodied-0.5-VLA-Data` egocentric UMI dataset (2K+ hours)!
📖 Abstract
We introduce Hy-Embodied-0.5-VLA (Hy-VLA) — an end-to-end Vision-Language-Action system that spans the full robot learning stack: data collection, model design, pre-training, supervised fine-tuning, RL post-training, and real-world deployment. Built on the Hy-Embodied-0.5 MoT backbone, Hy-VLA integrates a flow-matching action expert, a compact memory encoder for multi-frame history, and a delta-chunk action representation decoupled from embodiment-specific kinematics.
Powered by 10,000+ hours of high-fidelity UMI demonstrations collected via a custom fingertip interface with optical motion-capture, Hy-VLA achieves state-of-the-art results on the RoboTwin 2.0 benchmark (90.9% / 90.1% on Clean / Randomized) and demonstrates robust cross-embodiment transfer across four real-world robot platforms. Paired with FlowPRO preference optimization and an asynchronous inference framework, Hy-VLA establishes a scalable paradigm for continuous dexterous manipulation.
⭐ Key Features
- 🧠 Unified VLA Architecture: Extends the Hy-Embodied-0.5 MoT backbone with a dual-tower flow-matching action expert. The VLM tower handles vision-language understanding while the action expert generates continuous action chunks — all tied together through shared cross-modal attention.
- 🎯 Delta-Chunk Action Representation: Actions are predicted as relative-to-current-frame end-effector delta chunks, decoupling the policy from embodiment-specific kinematics and enabling seamless cross-embodiment transfer.
- 📹 Compact Memory Encoder: A parameter-free temporal-spatial attention mechanism interleaved within the ViT encoder compresses K-frame multi-view history into current-frame tokens, preserving temporal context without inflating the token budget.
- 📊 Hy-UMI-10K Dataset: 10K+ hours of sub-millimeter precision dual-arm demonstrations across 70+ tasks, collected with a custom fingertip UMI rig tracked by an optical motion-capture system. 2K+ hours are publicly released.
- 🚀 FlowPRO Post-Training *(under review)*: A critic-free preference optimization algorithm that converts real-robot failure interventions into rapid policy improvement without reward models.
- ⚡ Asynchronous Deployment Stack: Producer-consumer inference with cubic Bézier chunk stitching enables high-frequency closed-loop control across heterogeneous robot platforms.
📦 Repository Contents
Hy-Embodied-0.5-VLA/ ├── hy_vla/ # Core model definition, training, and inference │ ├── modeling_hy_vla.py # HyVLA model class │ ├── modeling_dual_tower.py # Dual-tower transformer (VLM + action expert) │ ├── configuration_hy_vla.py # Model configuration │ ├── space_time_attention.py # Temporal-spatial attention for memory encoder │ ├── data/ # Dataloader and dataset utilities │ ├── config/ # YAML configuration files │ └── hunyuan_vl_mot/ # Vendored Hy-Embodied VLM backbone (fallback) ├── scripts/ # Training, evaluation, and preprocessing scripts │ ├── quick_start.py # Fast smoke-test for a released checkpoint │ ├── train_umi_vlm.sh # Stage-1 pre-training launcher │ ├── train_robotwin_vlm.sh # Stage-2 SFT from VLM backbone │ ├── train_robotwin_umi.sh # Stage-2 SFT from UMI pretrain │ ├── train_table_vlm.sh # Single-table fast-iteration training │ ├── eval_robotwin_test.sh # Quick RoboTwin regression (6 tasks) │ ├── eval_robotwin_full.sh # Full RoboTwin sweep (50 tasks × 100 rollouts) │ ├── compute_norm_lance.py # Pre-compute norm stats from Lance data │ ├── compute_norm_hdf5.py # Pre-compute norm stats from HDF5 data │ └── vis_umi_episode.py # Render an episode as MP4 ├── robotwin_eval/ # RoboTwin adapter for evaluation ├── assets/ # Example data and index files └── pyproject.toml # Python project configuration (uv/pip)
🛠️ Installation
Prerequisites
- 🖥️ OS: Linux (recommended)
- 🐍 Python: 3.12 (recommended and tested)
- ⚡ CUDA: 12.x
- 🔥 PyTorch: ≥ 2.4
- 🎮 GPU: NVIDIA GPU with CUDA support (≥ 16 GB VRAM recommended)
Install via uv (recommended)
git clone https://github.com/Tencent-Hunyuan/Hy-Embodied-0.5-VLA cd Hy-Embodied-0.5-VLA # One-off: install uv curl -LsSf https://astral.sh/uv/install.sh | sh # Materialize the virtual environment uv sync
Install via pip
pip install -r requirements.txt
> Note: Hy-VLA depends on an upstream transformers fork that supports the Hy-Embodied MoT backbone. The pinned commit is specified in both requirements.txt and pyproject.toml. If the fork URL is unreachable, a verbatim vendor copy at hy_vla/hunyuan_vl_mot/ serves as fallback.
🚀 Quick Start
The fastest way to verify a fresh install is the bundled smoke test:
import torch
from huggingface_hub import snapshot_download
from hy_vla import HyVLA, HyVLAConfig
ckpt = snapshot_download("tencent/Hy-Embodied-0.5-VLA-RoboTwin")
config = HyVLAConfig.from_pretrained(ckpt)
policy = HyVLA.from_pretrained(ckpt, config=config)
policy.enable_video_encoder_if_needed()
policy = policy.to(device="cuda", dtype=torch.bfloat16).eval()
# (B, K, C, H, W); K=6 history slots
img = torch.zeros(1, 6, 3, 224, 224, device="cuda", dtype=torch.bfloat16)
# Normalized dual-arm EEF: [xyz(3) + rot6d(6) + gripper(1)] * 2
state = torch.zeros((1, config.max_state_dim), device="cuda", dtype=torch.bfloat16)
batch = {
"observation.images.top_head": img,
"observation.images.hand_left": img,
"observation.images.hand_right": img,
"observation.state": state,
"task": ["pick up the bottle"],
}
with torch.no_grad():
actions = policy.forward_evaluate(batch)["pred"]
actions = actions[..., :...Excerpt shown — open the source for the full document.
Notability
notability 4.0/10Low traction new model repo, not a major release.