RepoTencent HunyuanTencent Hunyuanpublished Apr 22, 2026seen 5d

Tencent-Hunyuan/HY-Embodied-0.5-X

Python

Open original ↗

Captured source

source ↗

Tencent-Hunyuan/HY-Embodied-0.5-X

Description: HY-Embodied-0.5-X: An Enhanced Embodied Foundation Model for Real-World Agents

Language: Python

License: NOASSERTION

Stars: 53

Forks: 4

Open issues: 0

Created: 2026-04-22T13:21:36Z

Pushed: 2026-05-14T03:31:55Z

Default branch: master

Fork: no

Archived: no

README:

HY-Embodied-0.5-X

An Enhanced Embodied Foundation Model for Real-World Agents

*Tencent Robotics X × HY Vision Team*

---

HY-Embodied-0.5-X is an enhanced open-source embodied multimodal foundation model jointly released by Tencent Robotics X and the HY Vision Team. Built on top of the HY-Embodied-0.5 MoT-2B architecture (4B total parameters with only 2B activated), it is specifically optimized for the core loop of real-world robotics — "understand, reason, and act".

The model reaches state-of-the-art performance on 10 mainstream embodied task-planning benchmarks, ranking 1st among edge-side domain models on 7 of them. Compared with general-purpose multimodal models, HY-Embodied-0.5-X focuses more tightly on the problems that matter in real-world robot interaction, with dedicated improvements in fine-grained manipulation understanding, spatial reasoning, action prediction, risk assessment, multimodal reference grounding, and long-horizon planning — pushing the model from *"seeing"* to *"doing"*.

🔥 Updates

  • `[2026-04-24]` 🚀 Released HY-Embodied-0.5-X, an embodied-focused

enhancement on top of HY-Embodied-0.5 MoT-2B, together with inference and training code.

⭐️ Key Features

1. 🧠 Stronger Spatial Understanding — accurately reasons about object positions, scene layout, relative spatial relations, and manipulation states, providing a reliable perceptual basis for action decisions. 2. 🔗 Stronger Long-Horizon Planning — handles multi-step, strongly-dependent complex tasks, producing stable task decomposition, action planning, and execution decisions across continuous interactions. 3. 🤖 Stronger Embodied Interaction — beyond visual understanding and dialogue, supports task parsing, reference resolution, action decisions, risk judgement, and failure reflection, closely matching the real robot interaction loop. 4. 📦 Edge-Friendly — built on the MoT-2B architecture (4B total / 2B activated), suitable for on-device deployment and real-time response.

📖 Model Highlights

1. Rich and Reliable Data Composition

HY-Embodied-0.5-X combines self-collected first-person robot manipulation data, robotic-arm manipulation data, and open-source embodied data into a high-quality corpus that covers manipulation understanding, first-person task reasoning, and multimodal reference grounding:

  • Robotic-arm / human-hand trajectories — dedicated data for state

understanding, next-action prediction, manipulation-risk assessment, failure diagnosis, and pairwise candidate-action comparison.

  • First-person embodied tasks — fine-grained action recognition,

subtask progress estimation, hand spatial localization, depth estimation, relative spatial reasoning, camera pose inference, and more.

  • Multimodal interactive reference grounding — data built around

ambiguous real-world instructions such as *"put this over there"*, combining speech and gesture cues.

All core samples are paired with chain-of-thought (CoT) annotations and a full "generate → verify → correct → eval-regression" data-quality loop. Embodied, internet, and 3D data are further unified through a standardized reconstruction pipeline that turns heterogeneous sources into consistent, high-quality embodied reasoning data.

2. "Validate → Scale → Full-Run" Training Strategy

Training follows a staged iterative strategy:

1. Quickly validate training configs and data cleaning on a small, high-quality subset. 2. Progressively scale up training data and compute. 3. Kick off full-scale training only after the optimal data mix and training strategy are confirmed.

This ensures each unit of compute is invested in the most valuable data.

📊 Evaluation

Overall Benchmark Results

Across 10 open-source benchmarks covering planning, spatial reasoning, embodied QA, visual reference, and trajectory understanding, HY-Embodied-0.5-X stays in the top tier.

Comparison with Same-Size Open-Source Models

AI2Thor Embodied Planning Benchmark

We built an internal embodied-planning benchmark on AI2Thor with 1,011 tasks across four household scenes (kitchen, bedroom, living room, bathroom), evaluating planning and execution on navigation, grasping, placement, appliance operation, and food cutting. HY-Embodied-0.5-X shows clear gains on long-horizon manipulation, self-awareness, and spatial understanding:

PlaygroundX Simulation Integration

HY-Embodied-0.5-X is integrated with the PlaygroundX simulation framework (built on Tairos). It produces full plans for household instructions such as *"throw the potato into the trash"*, *"close the fridge door"*, or *"put the tomato in the fridge"*, and adjusts execution based on environmental feedback — including on-the-fly replanning when an initial plan fails, forming a complete ReAct loop: *reason → execute → detect failure → replan*.

🛠️ Installation

A one-click conda setup script setup_env.sh is provided. It creates the environment, installs PyTorch / flash_attn / transformers (native HY-Embodied support) / and all remaining dependencies. flash_attn compiles from source and takes ~10–20 minutes:

bash setup_env.sh
conda activate hy_embodied_x

# (optional) expose the package as a console script
pip install -e .

Prerequisites

| Item | Requirement | |---------|---------------------------------| | OS | Linux | | Python | 3.12 | | CUDA | 12.6 | | PyTorch | 2.10.0 | | GPU | NVIDIA GPU with ≥ 16 GB VRAM |

> Key dependencies: transformers (specific commit, native HY-Embodied support), > flash_attn==2.8.3, accelerate, deepspeed, timm, liger-kernel. > See setup_env.sh and requirements.txt for the pinned list.

📥 Downloading the Weights

hf download tencent/HY-Embodied-0.5-X \
--local-dir ckpts/HY-Embodied-0.5-X

Weights (*.safetensors) are git-ignored and expected under ckpts/HY-Embodied-0.5-X/. The inference and training code also accepts the Hub repo id directly, which triggers on-demand download via transformers.…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine repo with low stars