ModelXiaomi (MiMo)Xiaomi (MiMo)published Aug 7, 2025seen 5d

XiaomiMiMo/MiMo-VL-7B-RL-2508

Open original ↗

Captured source

source ↗
published Aug 7, 2025seen 5dcaptured 11hhttp 200method plaintask image-text-to-textlicense mitlibrary transformersparams 8.3Bdownloads 1.3klikes 92

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

MiMo-VL Technical Report

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🔥🔥🔥MiMo-VL 2508 Updates

We're excited to announce improvements to our MiMo-VL (MiMo-VL-7B-RL-2508 and MiMo-VL-7B-SFT-2508), featuring enhanced performance across multiple benchmarks, improved thinking control capabilities, and better user experience.

📈 Performance Improvements

MiMo-VL-7B-RL-2508 demonstrates consistent improvements across both image and video benchmarks, achieving notable milestones of 70.6 on MMMU and 70.8 on VideoMME.

Full evaluation results can be found [below](#full-evaluation-results).

🤔 Thinking Control Feature

A thinking control capability that allows users to turn off the model's reasoning mode using the no_think parameter:

  • Thinking mode (default behavior): Full reasoning process visible with 100% control success rate;
  • Non-thinking mode: Direct responses without reasoning, with a 99.84% control success rate

❤️ Enhanced User Experience

Our internal VLM Arena ratings show meaningful improvement in real-world performance:

  • Current model (MiMo-VL-7B-RL-2508): 1131.2 rating
  • Previous version (MiMo-VL-7B-RL): 1093.9 rating

These updates deliver a more capable, flexible, and reliable vision-language model for both academic evaluation and practical applications.

😉 Case Study: What are the appealing features of this car?

![Xiaomi SU7](./su7.gif)

MiMo-VL-7B-RL-2508's Answer:

...
To determine the appealing features of the Xiaomi SU7, we analyze visual elements and context from the video:
1. Aesthetic Design
- The car boasts a sleek, aerodynamic exterior with smooth, modern lines.
- Distinctive lighting (sharp headlights and a continuous, glowing taillight strip) enhances its futuristic and sporty look.
- Vibrant color and polished finish make it visually striking.
2. Performance & Speed
- Multiple dynamic shots of the car driving at high speeds on diverse roads (desert, forest, city) emphasize agility and power.
- Close - ups of spinning wheels and the engine (with lightning-like effects) suggest advanced engineering and high-performance capabilities (likely electric or tech-driven).
3. Versatility
- The car is shown in varied environments (desert at dusk, forested roads, urban streets), highlighting its adaptability to different terrains and use cases.
4. Technological Innovation
- The engine close-up with electric-like lightning effects implies cutting-edge technology, positioning the car as a tech-forward vehicle (aligning with Xiaomi’s tech brand identity).
5. Brand Appeal
Prominent “Xiaomi SU7” branding leverages Xiaomi’s reputation in consumer tech, adding credibility and allure for tech-savvy audiences.

In summary, the Xiaomi SU7 appeals through its striking design, high-performance capabilities, versatility, technological innovation, and strong brand association.

Full Evaluation Results

Model Recommendation

Both versions of the MiMo-VL-7B-2508 model are now open-sourced on Hugging Face:

Deployment Parameters

  • temperature=0.3, topp=0.95
  • The system prompt is already set in chat_template.json and does not require additional configuration.

Thinking Control

Users can control the thinking mode by appending /no_think to queries:

  • Thinking mode query (default):

*"What is the answer to the question in the image?"*

  • Non-thinking mode query:

*"Identify the text in the image. /no_think"*

❗️Important: The /no_think command must be the very last part of user message, which means after /no_think, there shouldn't be any user content like image or video.

Placing Visual Input

For prompts with a single image or video, always place the visual media before the text. For example:

✅ Good:

messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": "Describe the image. /no_think"},
],
}
]

❌ Bad:

messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image. /no_think"},
{"type": "image", "image": image_path},
],
}
]

---

I. Introduction

In this report, we share our efforts to build a compact yet powerful VLM, MiMo-VL-7B. MiMo-VL-7B comprises (1) a native resolution ViT encoder that preserves fine-grained visual details, (2) an MLP projector for efficient cross-modal alignment, and (3) our MiMo-7B language model, specifically optimized for complex reasoning tasks.

The development of MiMo-VL-7B involves two sequential training processes: (1) A four-stage pre-training phase, which includes projector warmup, vision-language alignment, general multi-modal pre-training, and long-context Supervised Fine-Tuning (SFT). This phase yields the MiMo-VL-7B-SFT model. (2) A subsequent post-training phase, where we introduce Mixed On-policy Reinforcement Learning (MORL), a novel framework that seamlessly integrates diverse reward signals spanning perception accuracy, visual grounding precision, logical reasoning capabilities, and human/AI preferences. This phase yields the MiMo-VL-7B-RL model.

We open-source MiMo-VL-7B series, including checkpoints of the SFT and RL model. We believe this report along with the models will provide valuable insights to develop powerful reasoning VLMs that benefit the larger community.

🛤️ During this journey, we find

  • Incorporating high-quality, broad-coverage reasoning data from the pre-training stage is crucial for enhancing model performance
  • We curate high-quality reasoning data by identifying diverse queries, employing large reasoning models to regenerate responses with long CoT, and applying rejection sampling to ensure quality.
  • Rather than treating this as supplementary fine-tuning data, we incorporate substantial volumes of this synthetic reasoning data directly into the later pre-training stages, where extended training yields continued performance improvements without saturation.
  • **Mixed…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Modest traction for a 7B model from a non-core AI lab