stepfun-ai/Step3-VL-10B
Captured source
source ↗stepfun-ai/Step3-VL-10B
Description: Step3-VL-10B: A compact yet frontier multimodal model achieving SOTA performance at the 10B scale, matching open-source models 10-20x its size.
License: Apache-2.0
Stars: 407
Forks: 30
Open issues: 19
Created: 2026-01-13T09:13:12Z
Pushed: 2026-01-21T13:42:15Z
Default branch: main
Fork: no
Archived: no
README:
🚀 Introduction
STEP3-VL-10B is a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. Despite its compact 10B parameter footprint, STEP3-VL-10B excels in visual perception, complex reasoning, and human-centric alignment. It consistently outperforms models under the 10B scale and rivals or surpasses significantly larger open-weights models (10×–20× its size), such as GLM-4.6V (106B-A12B), Qwen3-VL-Thinking (235B-A22B), and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL.
The success of STEP3-VL-10B is driven by two key strategic designs:
1. Unified Pre-training on High-Quality Multimodal Corpus: A single-stage, fully unfrozen training strategy on a 1.2T token multimodal corpus, focusing on two foundational capabilities: reasoning (e.g., general knowledge and education-centric tasks) and perception (e.g., grounding, counting, OCR, and GUI interactions). By jointly optimizing the Perception Encoder and the Qwen3-8B decoder, STEP3-VL-10B establishes intrinsic vision-language synergy. 2. Scaled Multimodal Reinforcement Learning and Parallel Reasoning: Frontier capabilities are unlocked through a rigorous post-training pipeline comprising two-stage supervised finetuning (SFT) and over 1,400 iterations of RL with both verifiable rewards (RLVR) and human feedback (RLHF). Beyond sequential reasoning, we adopt Parallel Coordinated Reasoning (PaCoRe), which allocates test-time compute to aggregate evidence from parallel visual exploration.
📥 Model Zoo
| Model Name | Type | Hugging Face | ModelScope | |:-----------|:-----|:------------:|:----------:| | STEP3-VL-10B-Base | Base | 🤗 Download | 🤖 Download | | STEP3-VL-10B | Chat | 🤗 Download | 🤖 Download |
📊 Performance
STEP3-VL-10B delivers best-in-class performance across major multimodal benchmarks, establishing a new performance standard for compact models. The results demonstrate that STEP3-VL-10B is the most powerful open-source model in the 10B parameter class.
Comparison with Larger Models (10×–20× Larger)
| Benchmark | STEP3-VL-10B (SeRe) | STEP3-VL-10B (PaCoRe) | GLM-4.6V (106B-A12B) | Qwen3-VL (235B-A22B) | Gemini-2.5-Pro | Seed-1.5-VL | |:----------|:-------------------:|:---------------------:|:--------------------:|:--------------------:|:--------------:|:-----------:| | MMMU | 78.11 | 80.11 | 75.20 | 78.70 | 83.89 | 79.11 | | MathVista | 83.97 | 85.50 | 83.51 | 85.10 | 83.88 | 85.60 | | MathVision | 70.81 | 75.95 | 63.50 | 72.10 | 73.30 | 68.70 | | MMBench (EN) | 92.05 | 92.38 | 92.75 | 92.70 | 93.19 | 92.11 | | MMStar | 77.48 | 77.64 | 75.30 | 76.80 | 79.18 | 77.91 | | OCRBench | 86.75 | 89.00 | 86.20 | 87.30 | 85.90 | 85.20 | | AIME 2025 | 87.66 | 94.43 | 71.88 | 83.59 | 83.96 | 64.06 | | HMMT 2025 | 78.18 | 92.14 | 57.29 | 67.71 | 65.68 | 51.30 | | LiveCodeBench | 75.77 | 76.43 | 48.71 | 69.45 | 72.01 | 57.10 |
> Note on Inference Modes: > > SeRe (Sequential Reasoning): The standard inference mode using sequential generation (Chain-of-Thought) with a max length of 64K tokens. > > PaCoRe (Parallel Coordinated Reasoning): An advanced mode that scales test-time compute. It aggregates evidence from 16 parallel rollouts to synthesize a final answer, utilizing a max context length of 128K tokens. > > *Unless otherwise stated, scores below refer to the standard SeRe mode. Higher scores achieved via PaCoRe are explicitly marked.*
Comparison with Open-Source Models (7B–10B)
| Category | Benchmark | STEP3-VL-10B | GLM-4.6V-Flash (9B) | Qwen3-VL-Thinking (8B) | InternVL-3.5 (8B) | MiMo-VL-RL-2508 (7B) | |:---------|:----------|:------------:|:-------------------:|:----------------------:|:-----------------:|:--------------------:| | STEM Reasoning | MMMU | 78.11 | 71.17 | 73.53 | 71.69 | 71.14 | | | MathVision | 70.81 | 54.05 | 59.60 | 52.05 | 59.65 | | | MathVista | 83.97 | 82.85 | 78.50 | 76.78 | 79.86 | | | PhyX | 59.45 | 52.28 | 57.67 | 50.51 | 56.00 | | Recognition | MMBench (EN) | 92.05 | 91.04 | 90.55 | 88.20 | 89.91 | | | MMStar | 77.48 | 74.26 | 73.58 | 69.83 | 72.93 | | | ReMI | 67.29 | 60.75 | 57.17 | 52.65 | 63.13 | | OCR & Document | OCRBench | 86.75 | 85.97 | 82.85 | 83.70 | 85.40 | | | AI2D | 89.35 | 88.93 | 83.32 | 82.34 | 84.96 | | GUI Grounding | ScreenSpot-V2 | 92.61 | 92.14 | 93.60 | 84.02 | 90.82 | | | ScreenSpot-Pro | 51.55 | 45.68 | 46.60 | 15.39 | 34.84 | | | OSWorld-G | 59.02 | 54.71 | 56.70 | 31.91 | 50.54 | | Spatial | BLINK | 66.79 | 64.90 | 62.78 | 55.40 | 62.57 | | | All-Angles-Bench | 57.21 | 53.24 | 45.88 | 45.29 | 51.62 | | Code | HumanEval-V | 66.05 | 29.26 | 26.94 | 24.31 | 31.96 |
Key Capabilities
- STEM Reasoning: Achieves 94.43% on AIME 2025 and 75.95% on MathVision (with PaCoRe), demonstrating exceptional complex reasoning capabilities that outperform models 10×–20× larger.
- Visual Perception: Records 92.05% on MMBench and 80.11% on MMMU, establishing strong general visual understanding and multimodal reasoning.
- GUI & OCR: Delivers state-of-the-art performance on ScreenSpot-V2 (92.61%), ScreenSpot-Pro (51.55%), and OCRBench (86.75%), optimized for agentic and document understanding tasks.
- Spatial Understanding: Demonstrates emergent spatial awareness with 66.79% on BLINK and 57.21% on All-Angles-Bench, establishing strong potential for embodied intelligence applications.
🏗️ Architecture & Training
Architecture
- Visual Encoder: PE-lang (Language-Optimized Perception Encoder), 1.8B parameters.
- Decoder: Qwen3-8B.
- Projector: Two consecutive stride-2 layers…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10New vision-language model, moderate stars.