XiaomiMiMo/MiMo-VL-7B-SFT-2508
Captured source
source ↗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
MiMo-VL Technical Report
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔥🔥🔥MiMo-VL 2508 Updates
We're excited to announce improvements to our MiMo-VL (MiMo-VL-7B-RL-2508 and MiMo-VL-7B-SFT-2508), featuring enhanced performance across multiple benchmarks, improved thinking control capabilities, and better user experience.
📈 Performance Improvements
MiMo-VL-7B-RL-2508 demonstrates consistent improvements across both image and video benchmarks, achieving notable milestones of 70.6 on MMMU and 70.8 on VideoMME.
Full evaluation results can be found [below](#full-evaluation-results).
🤔 Thinking Control Feature
A thinking control capability that allows users to turn off the model's reasoning mode using the no_think parameter:
- Thinking mode (default behavior): Full reasoning process visible with 100% control success rate;
- Non-thinking mode: Direct responses without reasoning, with a 99.84% control success rate
❤️ Enhanced User Experience
Our internal VLM Arena ratings show meaningful improvement in real-world performance:
- Current model (MiMo-VL-7B-RL-2508): 1131.2 rating
- Previous version (MiMo-VL-7B-RL): 1093.9 rating
These updates deliver a more capable, flexible, and reliable vision-language model for both academic evaluation and practical applications.
😉 Case Study: What are the appealing features of this car?

MiMo-VL-7B-RL-2508's Answer:
... To determine the appealing features of the Xiaomi SU7, we analyze visual elements and context from the video: 1. Aesthetic Design - The car boasts a sleek, aerodynamic exterior with smooth, modern lines. - Distinctive lighting (sharp headlights and a continuous, glowing taillight strip) enhances its futuristic and sporty look. - Vibrant color and polished finish make it visually striking. 2. Performance & Speed - Multiple dynamic shots of the car driving at high speeds on diverse roads (desert, forest, city) emphasize agility and power. - Close - ups of spinning wheels and the engine (with lightning-like effects) suggest advanced engineering and high-performance capabilities (likely electric or tech-driven). 3. Versatility - The car is shown in varied environments (desert at dusk, forested roads, urban streets), highlighting its adaptability to different terrains and use cases. 4. Technological Innovation - The engine close-up with electric-like lightning effects implies cutting-edge technology, positioning the car as a tech-forward vehicle (aligning with Xiaomi’s tech brand identity). 5. Brand Appeal Prominent “Xiaomi SU7” branding leverages Xiaomi’s reputation in consumer tech, adding credibility and allure for tech-savvy audiences. In summary, the Xiaomi SU7 appeals through its striking design, high-performance capabilities, versatility, technological innovation, and strong brand association.
Full Evaluation Results
Model Recommendation
Both versions of the MiMo-VL-7B-2508 model are now open-sourced on Hugging Face:
- 🤗 **MiMo-VL-7B-RL-2508**
- Recommended for most users to experience and utilize.
- 🤗 **MiMo-VL-7B-SFT-2508**
- Users may perform SFT and RL based on this model. Compared to the previous SFT version, this model demonstrates higher RL stability.
Deployment Parameters
- temperature=0.3, topp=0.95
- The system prompt is already set in
chat_template.jsonand does not require additional configuration.
Thinking Control
Users can control the thinking mode by appending /no_think to queries:
- Thinking mode query (default):
*"What is the answer to the question in the image?"*
- Non-thinking mode query:
*"Identify the text in the image. /no_think"*
❗️Important: The /no_think command must be the very last part of user message, which means after /no_think, there shouldn't be any user content like image or video.
Placing Visual Input
For prompts with a single image or video, always place the visual media before the text. For example:
✅ Good:
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": "Describe the image. /no_think"},
],
}
]❌ Bad:
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image. /no_think"},
{"type": "image", "image": image_path},
],
}
]---
I. Introduction
In this report, we share our efforts to build a compact yet powerful VLM, MiMo-VL-7B. MiMo-VL-7B comprises (1) a native resolution ViT encoder that preserves fine-grained visual details, (2) an MLP projector for efficient cross-modal alignment, and (3) our MiMo-7B language model, specifically optimized for complex reasoning tasks.
The development of MiMo-VL-7B involves two sequential training processes: (1) A four-stage pre-training phase, which includes projector warmup, vision-language alignment, general multi-modal pre-training, and long-context Supervised Fine-Tuning (SFT). This phase yields the MiMo-VL-7B-SFT model. (2) A subsequent post-training phase, where we introduce Mixed On-policy Reinforcement Learning (MORL), a novel framework that seamlessly integrates diverse reward signals spanning perception accuracy, visual grounding precision, logical reasoning capabilities, and human/AI preferences. This phase yields the MiMo-VL-7B-RL model.
We open-source MiMo-VL-7B series, including checkpoints of the SFT and RL model. We believe this report along with the models will provide valuable insights to develop powerful reasoning VLMs that benefit the larger community.
🛤️ During this journey, we find
- Incorporating high-quality, broad-coverage reasoning data from the pre-training stage is crucial for enhancing model performance
- We curate high-quality reasoning data by identifying diverse queries, employing large reasoning models to regenerate responses with long CoT, and applying rejection sampling to ensure quality.
- Rather than treating this as supplementary fine-tuning data, we incorporate substantial volumes of this synthetic reasoning data directly into the later pre-training stages, where extended training yields continued performance improvements without saturation.
- **Mixed…
Excerpt shown — open the source for the full document.
Notability
notability 8.0/10Notable release with strong community traction