What does this repo signal mean?

StepFun published stepfun-ai/Step3-VL-10B. This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo stepfun-ai/Step3-VL-10B · New vision-language model, moderate stars.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

StepFun Repo: stepfun-ai/Step3-VL-10B

Captured source

source ↗

GitHub/github.com/stepfun-ai/Step3-VL-10B

stepfun-ai/Step3-VL-10B repository metadata

Source ↗

published Jan 13, 2026seen Jun 5captured Jun 11http 200method plain

stepfun-ai/Step3-VL-10B

Description: Step3-VL-10B: A compact yet frontier multimodal model achieving SOTA performance at the 10B scale, matching open-source models 10-20x its size.

License: Apache-2.0

Stars: 407

Forks: 30

Open issues: 19

Created: 2026-01-13T09:13:12Z

Pushed: 2026-01-21T13:42:15Z

Default branch: main

Fork: no

Archived: no

README:

🚀 Introduction

STEP3-VL-10B is a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. Despite its compact 10B parameter footprint, STEP3-VL-10B excels in visual perception, complex reasoning, and human-centric alignment. It consistently outperforms models under the 10B scale and rivals or surpasses significantly larger open-weights models (10×–20× its size), such as GLM-4.6V (106B-A12B), Qwen3-VL-Thinking (235B-A22B), and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL.

The success of STEP3-VL-10B is driven by two key strategic designs:

1. Unified Pre-training on High-Quality Multimodal Corpus: A single-stage, fully unfrozen training strategy on a 1.2T token multimodal corpus, focusing on two foundational capabilities: reasoning (e.g., general knowledge and education-centric tasks) and perception (e.g., grounding, counting, OCR, and GUI interactions). By jointly optimizing the Perception Encoder and the Qwen3-8B decoder, STEP3-VL-10B establishes intrinsic vision-language synergy. 2. Scaled Multimodal Reinforcement Learning and Parallel Reasoning: Frontier capabilities are unlocked through a rigorous post-training pipeline comprising two-stage supervised finetuning (SFT) and over 1,400 iterations of RL with both verifiable rewards (RLVR) and human feedback (RLHF). Beyond sequential reasoning, we adopt Parallel Coordinated Reasoning (PaCoRe), which allocates test-time compute to aggregate evidence from parallel visual exploration.

📥 Model Zoo

| Model Name | Type | Hugging Face | ModelScope | |:-----------|:-----|:------------:|:----------:| | STEP3-VL-10B-Base | Base | 🤗 Download | 🤖 Download | | STEP3-VL-10B | Chat | 🤗 Download | 🤖 Download |

📊 Performance

STEP3-VL-10B delivers best-in-class performance across major multimodal benchmarks, establishing a new performance standard for compact models. The results demonstrate that STEP3-VL-10B is the most powerful open-source model in the 10B parameter class.

Comparison with Larger Models (10×–20× Larger)

| Benchmark | STEP3-VL-10B (SeRe) | STEP3-VL-10B (PaCoRe) | GLM-4.6V (106B-A12B) | Qwen3-VL (235B-A22B) | Gemini-2.5-Pro | Seed-1.5-VL | |:----------|:-------------------:|:---------------------:|:--------------------:|:--------------------:|:--------------:|:-----------:| | MMMU | 78.11 | 80.11 | 75.20 | 78.70 | 83.89 | 79.11 | | MathVista | 83.97 | 85.50 | 83.51 | 85.10 | 83.88 | 85.60 | | MathVision | 70.81 | 75.95 | 63.50 | 72.10 | 73.30 | 68.70 | | MMBench (EN) | 92.05 | 92.38 | 92.75 | 92.70 | 93.19 | 92.11 | | MMStar | 77.48 | 77.64 | 75.30 | 76.80 | 79.18 | 77.91 | | OCRBench | 86.75 | 89.00 | 86.20 | 87.30 | 85.90 | 85.20 | | AIME 2025 | 87.66 | 94.43 | 71.88 | 83.59 | 83.96 | 64.06 | | HMMT 2025 | 78.18 | 92.14 | 57.29 | 67.71 | 65.68 | 51.30 | | LiveCodeBench | 75.77 | 76.43 | 48.71 | 69.45 | 72.01 | 57.10 |

> Note on Inference Modes: > > SeRe (Sequential Reasoning): The standard inference mode using sequential generation (Chain-of-Thought) with a max length of 64K tokens. > > PaCoRe (Parallel Coordinated Reasoning): An advanced mode that scales test-time compute. It aggregates evidence from 16 parallel rollouts to synthesize a final answer, utilizing a max context length of 128K tokens. > > *Unless otherwise stated, scores below refer to the standard SeRe mode. Higher scores achieved via PaCoRe are explicitly marked.*

Comparison with Open-Source Models (7B–10B)

| Category | Benchmark | STEP3-VL-10B | GLM-4.6V-Flash (9B) | Qwen3-VL-Thinking (8B) | InternVL-3.5 (8B) | MiMo-VL-RL-2508 (7B) | |:---------|:----------|:------------:|:-------------------:|:----------------------:|:-----------------:|:--------------------:| | STEM Reasoning | MMMU | 78.11 | 71.17 | 73.53 | 71.69 | 71.14 | | | MathVision | 70.81 | 54.05 | 59.60 | 52.05 | 59.65 | | | MathVista | 83.97 | 82.85 | 78.50 | 76.78 | 79.86 | | | PhyX | 59.45 | 52.28 | 57.67 | 50.51 | 56.00 | | Recognition | MMBench (EN) | 92.05 | 91.04 | 90.55 | 88.20 | 89.91 | | | MMStar | 77.48 | 74.26 | 73.58 | 69.83 | 72.93 | | | ReMI | 67.29 | 60.75 | 57.17 | 52.65 | 63.13 | | OCR & Document | OCRBench | 86.75 | 85.97 | 82.85 | 83.70 | 85.40 | | | AI2D | 89.35 | 88.93 | 83.32 | 82.34 | 84.96 | | GUI Grounding | ScreenSpot-V2 | 92.61 | 92.14 | 93.60 | 84.02 | 90.82 | | | ScreenSpot-Pro | 51.55 | 45.68 | 46.60 | 15.39 | 34.84 | | | OSWorld-G | 59.02 | 54.71 | 56.70 | 31.91 | 50.54 | | Spatial | BLINK | 66.79 | 64.90 | 62.78 | 55.40 | 62.57 | | | All-Angles-Bench | 57.21 | 53.24 | 45.88 | 45.29 | 51.62 | | Code | HumanEval-V | 66.05 | 29.26 | 26.94 | 24.31 | 31.96 |

Key Capabilities

STEM Reasoning: Achieves 94.43% on AIME 2025 and 75.95% on MathVision (with PaCoRe), demonstrating exceptional complex reasoning capabilities that outperform models 10×–20× larger.
Visual Perception: Records 92.05% on MMBench and 80.11% on MMMU, establishing strong general visual understanding and multimodal reasoning.
GUI & OCR: Delivers state-of-the-art performance on ScreenSpot-V2 (92.61%), ScreenSpot-Pro (51.55%), and OCRBench (86.75%), optimized for agentic and document understanding tasks.
Spatial Understanding: Demonstrates emergent spatial awareness with 66.79% on BLINK and 57.21% on All-Angles-Bench, establishing strong potential for embodied intelligence applications.

🏗️ Architecture & Training

Architecture

Visual Encoder: PE-lang (Language-Optimized Perception Encoder), 1.8B parameters.
Decoder: Qwen3-8B.
Projector: Two consecutive stride-2 layers...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

New vision-language model, moderate stars.