What does this model signal mean?

StepFun published stepfun-ai/Step-3.5-Flash-Base. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license apache-2.0 · 168 HF downloads · Fast base multimodal model by Chinese AI startup StepFun.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

StepFun Model: stepfun-ai/Step-3.5-Flash-Base

Captured source

source ↗

Hugging Face/huggingface.co/stepfun-ai/Step-3.5-Flash-Base

stepfun-ai/Step-3.5-Flash-Base model card

Source ↗

published Mar 2, 2026seen Jun 6captured Jun 11http 200method plaintask text-generationlicense apache-2.0library transformersparams 198Bdownloads 168likes 85

Step 3.5 Flash Base

1. Introduction

Step 3.5 Flash (visit website) is our most capable open-source foundation model, engineered to deliver frontier reasoning and agentic capabilities with exceptional efficiency. We also open-sourced the training codebase (SteptronOss), with support for continue pretrain, SFT, RL (WIP), and evaluation (WIP), and will open-source the SFT data. Built on a sparse Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token. This "intelligence density" allows it to rival the reasoning depth of top-tier proprietary models, while maintaining the agility required for real-time interaction.

2. Key Capabilities

Deep Reasoning at Speed: While chatbots are built for reading, agents must reason fast. Powered by 3-way Multi-Token Prediction (MTP-3), Step 3.5 Flash achieves a generation throughput of 100–300 tok/s in typical usage (peaking at 350 tok/s for single-stream coding tasks). This allows for complex, multi-step reasoning chains with immediate responsiveness.

A Robust Engine for Coding & Agents: Step 3.5 Flash is purpose-built for agentic tasks, integrating a scalable RL framework that drives consistent self-improvement. It achieves 74.4% on SWE-bench Verified and 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability.

Efficient Long Context: The model supports a cost-efficient 256K context window by employing a 3:1 Sliding Window Attention (SWA) ratio—integrating three SWA layers for every full-attention layer. This hybrid approach ensures consistent performance across massive datasets or long codebases while significantly reducing the computational overhead typical of standard long-context models.

Accessible Local Deployment: Optimized for accessibility, Step 3.5 Flash brings elite-level intelligence to local environments. It runs securely on high-end consumer hardware (e.g., Mac Studio M4 Max, NVIDIA DGX Spark), ensuring data privacy without sacrificing performance.

3. Performance

Step 3.5 Flash delivers performance parity with leading closed-source systems while remaining open and efficient.

![](step-bar-chart.png)

Performance of Step 3.5 Flash measured across Reasoning, Coding, and Agentic Abilities. Open-source models (left) are sorted by their total parameter count, while top-tier proprietary models are shown on the right. xbench-DeepSearch scores are sourced from official publications for consistency. The shadowed bars represent the enhanced performance of Step 3.5 Flash using Parallel Thinking.

Detailed Benchmarks

| Benchmark | # Shots | Step3.5 Flash (Base) | MiMo‑V2 Flash (Base) | GLM‑4.5 (Base) | DeepSeek V3.1 (Base) | DeepSeek V3.2 (Exp Base) | Kimi‑K2 (Base) | | --- | --- | --- | --- | --- | --- | --- | --- | | # Activated Params | - | 11B | 15B | 32B | 37B | 37B | 32B | | # Total Params | - | 196B | 309B | 355B | 671B | 671B | 1043B | | General | | | | | | | | | BBH | 3-shot | 88.2 | 88.5 | 86.2 | 88.2† | 88.7† | 88.7 | | MMLU | 5-shot | 85.8 | 86.7 | 86.1 | 87.4† | 87.8† | 87.8 | | MMLU‑Redux | 5-shot | 89.2 | 90.6 | - | 90.0† | 90.4† | 90.2 | | MMLU‑Pro | 5-shot | 62.3 | 73.2 | - | 58.8† | 62.1† | 69.2 | | HellaSwag | 10-shot | 90.2 | 88.5 | 87.1 | 89.2† | 89.4† | 94.6 | | WinoGrande | 5-shot | 79.1 | 83.8 | - | 85.9† | 85.6† | 85.3 | | GPQA | 5-shot | 41.7 | 43.5* | 33.5* | 43.1* | 37.3* | 43.1* | | SuperGPQA | 5-shot | 41.0 | 41.1 | - | 42.3† | 43.6† | 44.7 | | SimpleQA | 5-shot | 31.6 | 20.6 | 30.0 | 26.3† | 27.0† | 35.3 | | Mathematics | | | | | | | | | GSM8K | 8-shot | 88.2 | 92.3 | 87.6 | 91.4† | 91.1† | 92.1 | | MATH | 4-shot | 66.8 | 71.0 | 62.6 | 62.6† | 62.5† | 70.2 | | Code | | | | | | | | | HumanEval | 3-shot | 81.1 | 77.4* | 79.8* | 72.5* | 67.7* | 84.8* | | MBPP | 3-shot | 79.4 | 81.0* | 81.6* | 74.6* | 75.6* | 89.0* | | HumanEval+ | 0-shot | 72.0 | 70.7 | - | 64.6† | 67.7† | - | | MBPP+ | 0-shot | 70.6 | 71.4 | - | 72.2† | 69.8† | - | | MultiPL‑E HumanEval | 0-shot | 67.7 | 59.5 | - | 45.9† | 45.7† | 60.5 | | MultiPL‑E MBPP | 0-shot | 58.0 | 56.7 | - | 52.5† | 50.6† | 58.8 | | Chinese | | | | | | | | | C‑EVAL | 5-shot | 89.6 | 87.9 | 86.9 | 90.0† | 91.0† | 92.5 | | CMMLU | 5-shot | 88.9 | 87.4 | - | 88.8† | 88.9† | 90.9 | | C‑SimpleQA | 5-shot | 63.2 | 61.5 | 70.1 | 70.9† | 68.0† | 77.6 |

1. “*” denotes cases where the original score was unavailable; we report results evaluated under the same test conditions as Step3.5 Flash for fair comparison. 2. “†” indicates DeepSeek scores quoted from the MiMo‑V2‑Flash report.

Recommended Inference Parameters

1. For general chat domain, we suggest: temperature=0.6, top_p=0.95 2. For reasoning / agent scenario, we recommend: temperature=1.0, top_p=0.95.

4. Architecture Details

Step 3.5 Flash is built on a Sparse Mixture-of-Experts (MoE) transformer architecture, optimized for high throughput and low VRAM usage during inference.

4.1 Technical Specifications

4.2 Mixture of Experts (MoE) Routing

Unlike traditional dense models, Step 3.5 Flash uses a fine-grained routing strategy to maximize efficiency:

Fine-Grained Experts: 288 routed experts per layer + 1 shared expert (always active).
Sparse Activation: Only the Top-8 experts are selected per token.
Result: The model retains the "memory" of a 196B parameter model but executes with the speed of an 11B model.

4.3 Multi-Token Prediction (MTP)

To improve inference speed, we utilize a specialized MTP Head consisting of a sliding-window attention mechanism and a dense Feed-Forward Network (FFN). This module predicts 4 tokens simultaneously in a single forward pass, significantly accelerating inference without degrading quality.

5. Training Codebase

The training codebase for Step 3.5 Flash is available at SteptronOss.

📜 Citation

If you find this project useful in your research,...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low traction model release