RepoTencent HunyuanTencent Hunyuanpublished Jun 12, 2026seen 1w

Tencent-Hunyuan/Hy-Embodied-0.5-VLA

Python

Open original ↗

Captured source

source ↗

Tencent-Hunyuan/Hy-Embodied-0.5-VLA

Language: Python

License: NOASSERTION

Stars: 5

Forks: 0

Open issues: 0

Created: 2026-06-12T03:47:21Z

Pushed: 2026-06-15T03:33:09Z

Default branch: main

Fork: no

Archived: no

README:

https://github.com/user-attachments/assets/e472c495-6fa6-4171-ae00-418bdd97473b

🔥 Updates

📖 Abstract

We introduce Hy-Embodied-0.5-VLA (Hy-VLA) — an end-to-end Vision-Language-Action system that spans the full robot learning stack: data collection, model design, pre-training, supervised fine-tuning, RL post-training, and real-world deployment. Built on the Hy-Embodied-0.5 MoT backbone, Hy-VLA integrates a flow-matching action expert, a compact memory encoder for multi-frame history, and a delta-chunk action representation decoupled from embodiment-specific kinematics.

Powered by 10,000+ hours of high-fidelity UMI demonstrations collected via a custom fingertip interface with optical motion-capture, Hy-VLA achieves state-of-the-art results on the RoboTwin 2.0 benchmark (90.9% / 90.1% on Clean / Randomized) and demonstrates robust cross-embodiment transfer across four real-world robot platforms. Paired with FlowPRO preference optimization and an asynchronous inference framework, Hy-VLA establishes a scalable paradigm for continuous dexterous manipulation.

⭐ Key Features

  • 🧠 Unified VLA Architecture: Extends the Hy-Embodied-0.5 MoT backbone with a dual-tower flow-matching action expert. The VLM tower handles vision-language understanding while the action expert generates continuous action chunks — all tied together through shared cross-modal attention.
  • 🎯 Delta-Chunk Action Representation: Actions are predicted as relative-to-current-frame end-effector delta chunks, decoupling the policy from embodiment-specific kinematics and enabling seamless cross-embodiment transfer.
  • 📹 Compact Memory Encoder: A parameter-free temporal-spatial attention mechanism interleaved within the ViT encoder compresses K-frame multi-view history into current-frame tokens, preserving temporal context without inflating the token budget.
  • 📊 Hy-UMI-10K Dataset: 10K+ hours of sub-millimeter precision dual-arm demonstrations across 70+ tasks, collected with a custom fingertip UMI rig tracked by an optical motion-capture system. 2K+ hours are publicly released.
  • 🚀 FlowPRO Post-Training *(under review)*: A critic-free preference optimization algorithm that converts real-robot failure interventions into rapid policy improvement without reward models.
  • Asynchronous Deployment Stack: Producer-consumer inference with cubic Bézier chunk stitching enables high-frequency closed-loop control across heterogeneous robot platforms.

📦 Repository Contents

Hy-Embodied-0.5-VLA/
├── hy_vla/ # Core model definition, training, and inference
│ ├── modeling_hy_vla.py # HyVLA model class
│ ├── modeling_dual_tower.py # Dual-tower transformer (VLM + action expert)
│ ├── configuration_hy_vla.py # Model configuration
│ ├── space_time_attention.py # Temporal-spatial attention for memory encoder
│ ├── data/ # Dataloader and dataset utilities
│ ├── config/ # YAML configuration files
│ └── hunyuan_vl_mot/ # Vendored Hy-Embodied VLM backbone (fallback)
├── scripts/ # Training, evaluation, and preprocessing scripts
│ ├── quick_start.py # Fast smoke-test for a released checkpoint
│ ├── train_umi_vlm.sh # Stage-1 pre-training launcher
│ ├── train_robotwin_vlm.sh # Stage-2 SFT from VLM backbone
│ ├── train_robotwin_umi.sh # Stage-2 SFT from UMI pretrain
│ ├── train_table_vlm.sh # Single-table fast-iteration training
│ ├── eval_robotwin_test.sh # Quick RoboTwin regression (6 tasks)
│ ├── eval_robotwin_full.sh # Full RoboTwin sweep (50 tasks × 100 rollouts)
│ ├── compute_norm_lance.py # Pre-compute norm stats from Lance data
│ ├── compute_norm_hdf5.py # Pre-compute norm stats from HDF5 data
│ └── vis_umi_episode.py # Render an episode as MP4
├── robotwin_eval/ # RoboTwin adapter for evaluation
├── assets/ # Example data and index files
└── pyproject.toml # Python project configuration (uv/pip)

🛠️ Installation

Prerequisites

  • 🖥️ OS: Linux (recommended)
  • 🐍 Python: 3.12 (recommended and tested)
  • CUDA: 12.x
  • 🔥 PyTorch: ≥ 2.4
  • 🎮 GPU: NVIDIA GPU with CUDA support (≥ 16 GB VRAM recommended)

Install via uv (recommended)

git clone https://github.com/Tencent-Hunyuan/Hy-Embodied-0.5-VLA
cd Hy-Embodied-0.5-VLA

# One-off: install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Materialize the virtual environment
uv sync

Install via pip

pip install -r requirements.txt

> Note: Hy-VLA depends on an upstream transformers fork that supports the Hy-Embodied MoT backbone. The pinned commit is specified in both requirements.txt and pyproject.toml. If the fork URL is unreachable, a verbatim vendor copy at hy_vla/hunyuan_vl_mot/ serves as fallback.

🚀 Quick Start

The fastest way to verify a fresh install is the bundled smoke test:

import torch
from huggingface_hub import snapshot_download
from hy_vla import HyVLA, HyVLAConfig

ckpt = snapshot_download("tencent/Hy-Embodied-0.5-VLA-RoboTwin")

config = HyVLAConfig.from_pretrained(ckpt)
policy = HyVLA.from_pretrained(ckpt, config=config)
policy.enable_video_encoder_if_needed()
policy = policy.to(device="cuda", dtype=torch.bfloat16).eval()

# (B, K, C, H, W); K=6 history slots
img = torch.zeros(1, 6, 3, 224, 224, device="cuda", dtype=torch.bfloat16)
# Normalized dual-arm EEF: [xyz(3) + rot6d(6) + gripper(1)] * 2
state = torch.zeros((1, config.max_state_dim), device="cuda", dtype=torch.bfloat16)
batch = {
"observation.images.top_head": img,
"observation.images.hand_left": img,
"observation.images.hand_right": img,
"observation.state": state,
"task": ["pick up the bottle"],
}

with torch.no_grad():
actions = policy.forward_evaluate(batch)["pred"]
actions = actions[..., :...

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Low traction new model repo, not a major release.