What does this model signal mean?

Xiaomi (MiMo) published XiaomiMiMo/MiMo-Embodied-7B. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license mit · 730 HF downloads · Xiaomi's 7B-parameter model for embodied AI tasks.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

Xiaomi (MiMo) Model: XiaomiMiMo/MiMo-Embodied-7B

Captured source

source ↗

Hugging Face/huggingface.co/XiaomiMiMo/MiMo-Embodied-7B

XiaomiMiMo/MiMo-Embodied-7B model card

Source ↗

published Nov 19, 2025seen Jun 6captured Jun 11http 200method plaintask image-text-to-textlicense mitlibrary transformersparams 8.3Bdownloads 730likes 70

I. Introduction

MiMo-Embodied, a powerful cross-embodied vision-language model that shows state-of-the-art performance in both autonomous driving and embodied AI tasks, the first open-source VLM that integrates these two critical areas, significantly enhancing understanding and reasoning in dynamic physical environments.

II. Model Capabilities

III. Model Details

IV. Evaluation Results

MiMo-Embodied demonstrates superior performance across 17 benchmarks in three key embodied AI capabilities: Task Planning, Affordance Prediction, and Spatial Understanding, significantly surpassing existing open-source embodied VLM models and rivaling closed-source models.

Additionally, MiMo-Embodied excels in 12 autonomous driving benchmarks across three key capabilities: Environmental Perception, Status Prediction, and Driving Planning—significantly outperforming both existing open-source and closed-source VLM models, as well as proprietary VLM models.

Moreover, evaluation on 8 general visual understanding benchmarks confirms that MiMo-Embodied retains and even strengthens its general capabilities, showing that domain-specialized training enhances rather than diminishes overall model proficiency.

Embodied AI Benchmarks

Affordance & Planning

Spatial Understanding

Autonomous Driving Benchmarks

Single-View Image & Multi-View Video

Multi-View Image & Single-View Video

General Visual Understanding Benchmarks

> Results marked with \* are obtained using our evaluation framework.

V. Case Visualization

Embodied AI

Affordance Prediction

Task Planning

Spatial Understanding

Autonomous Driving

Environmental Perception

Status Prediction

Driving Planning

Real-world Tasks

Embodied Navigation

Embodied Manipulation

VI. Citation

@misc{hao2025mimoembodiedxembodiedfoundationmodel,
title={MiMo-Embodied: X-Embodied Foundation Model Technical Report},
author={Xiaomi Embodied Intelligence Team},
year={2025},
eprint={2511.16518},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2511.16518},
}

Notability

notability 5.0/10

New embodied model release, moderate traction.