Tencent-Hunyuan/HY-Embodied-0.5-X
Python
Captured source
source ↗Tencent-Hunyuan/HY-Embodied-0.5-X
Description: HY-Embodied-0.5-X: An Enhanced Embodied Foundation Model for Real-World Agents
Language: Python
License: NOASSERTION
Stars: 53
Forks: 4
Open issues: 0
Created: 2026-04-22T13:21:36Z
Pushed: 2026-05-14T03:31:55Z
Default branch: master
Fork: no
Archived: no
README:
HY-Embodied-0.5-X
An Enhanced Embodied Foundation Model for Real-World Agents
*Tencent Robotics X × HY Vision Team*
---
HY-Embodied-0.5-X is an enhanced open-source embodied multimodal foundation model jointly released by Tencent Robotics X and the HY Vision Team. Built on top of the HY-Embodied-0.5 MoT-2B architecture (4B total parameters with only 2B activated), it is specifically optimized for the core loop of real-world robotics — "understand, reason, and act".
The model reaches state-of-the-art performance on 10 mainstream embodied task-planning benchmarks, ranking 1st among edge-side domain models on 7 of them. Compared with general-purpose multimodal models, HY-Embodied-0.5-X focuses more tightly on the problems that matter in real-world robot interaction, with dedicated improvements in fine-grained manipulation understanding, spatial reasoning, action prediction, risk assessment, multimodal reference grounding, and long-horizon planning — pushing the model from *"seeing"* to *"doing"*.
🔥 Updates
- `[2026-04-24]` 🚀 Released HY-Embodied-0.5-X, an embodied-focused
enhancement on top of HY-Embodied-0.5 MoT-2B, together with inference and training code.
⭐️ Key Features
1. 🧠 Stronger Spatial Understanding — accurately reasons about object positions, scene layout, relative spatial relations, and manipulation states, providing a reliable perceptual basis for action decisions. 2. 🔗 Stronger Long-Horizon Planning — handles multi-step, strongly-dependent complex tasks, producing stable task decomposition, action planning, and execution decisions across continuous interactions. 3. 🤖 Stronger Embodied Interaction — beyond visual understanding and dialogue, supports task parsing, reference resolution, action decisions, risk judgement, and failure reflection, closely matching the real robot interaction loop. 4. 📦 Edge-Friendly — built on the MoT-2B architecture (4B total / 2B activated), suitable for on-device deployment and real-time response.
📖 Model Highlights
1. Rich and Reliable Data Composition
HY-Embodied-0.5-X combines self-collected first-person robot manipulation data, robotic-arm manipulation data, and open-source embodied data into a high-quality corpus that covers manipulation understanding, first-person task reasoning, and multimodal reference grounding:
- Robotic-arm / human-hand trajectories — dedicated data for state
understanding, next-action prediction, manipulation-risk assessment, failure diagnosis, and pairwise candidate-action comparison.
- First-person embodied tasks — fine-grained action recognition,
subtask progress estimation, hand spatial localization, depth estimation, relative spatial reasoning, camera pose inference, and more.
- Multimodal interactive reference grounding — data built around
ambiguous real-world instructions such as *"put this over there"*, combining speech and gesture cues.
All core samples are paired with chain-of-thought (CoT) annotations and a full "generate → verify → correct → eval-regression" data-quality loop. Embodied, internet, and 3D data are further unified through a standardized reconstruction pipeline that turns heterogeneous sources into consistent, high-quality embodied reasoning data.
2. "Validate → Scale → Full-Run" Training Strategy
Training follows a staged iterative strategy:
1. Quickly validate training configs and data cleaning on a small, high-quality subset. 2. Progressively scale up training data and compute. 3. Kick off full-scale training only after the optimal data mix and training strategy are confirmed.
This ensures each unit of compute is invested in the most valuable data.
📊 Evaluation
Overall Benchmark Results
Across 10 open-source benchmarks covering planning, spatial reasoning, embodied QA, visual reference, and trajectory understanding, HY-Embodied-0.5-X stays in the top tier.
Comparison with Same-Size Open-Source Models
AI2Thor Embodied Planning Benchmark
We built an internal embodied-planning benchmark on AI2Thor with 1,011 tasks across four household scenes (kitchen, bedroom, living room, bathroom), evaluating planning and execution on navigation, grasping, placement, appliance operation, and food cutting. HY-Embodied-0.5-X shows clear gains on long-horizon manipulation, self-awareness, and spatial understanding:
PlaygroundX Simulation Integration
HY-Embodied-0.5-X is integrated with the PlaygroundX simulation framework (built on Tairos). It produces full plans for household instructions such as *"throw the potato into the trash"*, *"close the fridge door"*, or *"put the tomato in the fridge"*, and adjusts execution based on environmental feedback — including on-the-fly replanning when an initial plan fails, forming a complete ReAct loop: *reason → execute → detect failure → replan*.
🛠️ Installation
A one-click conda setup script setup_env.sh is provided. It creates the environment, installs PyTorch / flash_attn / transformers (native HY-Embodied support) / and all remaining dependencies. flash_attn compiles from source and takes ~10–20 minutes:
bash setup_env.sh conda activate hy_embodied_x # (optional) expose the package as a console script pip install -e .
Prerequisites
| Item | Requirement | |---------|---------------------------------| | OS | Linux | | Python | 3.12 | | CUDA | 12.6 | | PyTorch | 2.10.0 | | GPU | NVIDIA GPU with ≥ 16 GB VRAM |
> Key dependencies: transformers (specific commit, native HY-Embodied support), > flash_attn==2.8.3, accelerate, deepspeed, timm, liger-kernel. > See setup_env.sh and requirements.txt for the pinned list.
📥 Downloading the Weights
hf download tencent/HY-Embodied-0.5-X \ --local-dir ckpts/HY-Embodied-0.5-X
Weights (*.safetensors) are git-ignored and expected under ckpts/HY-Embodied-0.5-X/. The inference and training code also accepts the Hub repo id directly, which triggers on-demand download via transformers.…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine repo with low stars