ModelStepFunStepFunpublished Feb 1, 2026seen 5d

stepfun-ai/Step-3.5-Flash

Open original ↗

Captured source

source ↗
published Feb 1, 2026seen 5dcaptured 9hhttp 200method plaintask text-generationlicense apache-2.0library transformersparams 199Bdownloads 326klikes 820

Step 3.5 Flash

1. Introduction

Step 3.5 Flash (visit website) is our most capable open-source foundation model, engineered to deliver frontier reasoning and agentic capabilities with exceptional efficiency. Built on a sparse Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token. This "intelligence density" allows it to rival the reasoning depth of top-tier proprietary models, while maintaining the agility required for real-time interaction.

2. Key Capabilities

  • Deep Reasoning at Speed: While chatbots are built for reading, agents must reason fast. Powered by 3-way Multi-Token Prediction (MTP-3), Step 3.5 Flash achieves a generation throughput of 100–300 tok/s in typical usage (peaking at 350 tok/s for single-stream coding tasks). This allows for complex, multi-step reasoning chains with immediate responsiveness.
  • A Robust Engine for Coding & Agents: Step 3.5 Flash is purpose-built for agentic tasks, integrating a scalable RL framework that drives consistent self-improvement. It achieves 74.4% on SWE-bench Verified and 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability.
  • Efficient Long Context: The model supports a cost-efficient 256K context window by employing a 3:1 Sliding Window Attention (SWA) ratio—integrating three SWA layers for every full-attention layer. This hybrid approach ensures consistent performance across massive datasets or long codebases while significantly reducing the computational overhead typical of standard long-context models.
  • Accessible Local Deployment: Optimized for accessibility, Step 3.5 Flash brings elite-level intelligence to local environments. It runs securely on high-end consumer hardware (e.g., Mac Studio M4 Max, NVIDIA DGX Spark), ensuring data privacy without sacrificing performance.

3. Performance

Step 3.5 Flash delivers performance parity with leading closed-source systems while remaining open and efficient.

![](step-bar-chart.png)

Performance of Step 3.5 Flash measured across Reasoning, Coding, and Agentic Abilities. Open-source models (left) are sorted by their total parameter count, while top-tier proprietary models are shown on the right. xbench-DeepSearch scores are sourced from official publications for consistency. The shadowed bars represent the enhanced performance of Step 3.5 Flash using Parallel Thinking.

Detailed Benchmarks

| Benchmark | Step 3.5 Flash | DeepSeek V3.2 | Kimi K2 Thinking / K2.5 | GLM-4.7 | MiniMax M2.1 | MiMo-V2 Flash | | --- | --- | --- | --- | --- | --- | --- | | # Activated Params | 11B | 37B | 32B | 32B | 10B | 15B | | # Total Params (MoE) | 196B | 671B | 1T | 355B | 230B | 309B | | Est. decoding cost @ 128K context, Hopper GPU | 1.0x 100 tok/s, MTP-3, EP8 | 6.0x 33 tok/s, MTP-1, EP32 | 18.9x 33 tok/s, no MTP, EP32 | 18.9x 100 tok/s, MTP-3, EP8 | 3.9x 100 tok/s, MTP-3, EP8 | 1.2x 100 tok/s, MTP-3, EP8 | | | | | Agent** | | | | | τ²-Bench | 88.2 | 80.3 (85.2*) | 74.3*/85.4* | 87.4 | 86.6* | 80.3 (84.1*) | | BrowseComp | 51.6 | 51.4 | 41.5* / 60.6 | 52.0 | 47.4 | 45.4 | | BrowseComp (w/ Context Manager) | 69.0 | 67.6 | 60.2/74.9 | 67.5 | 62.0 | 58.3 | | BrowseComp-ZH | 66.9 | 65.0 | 62.3 / 62.3* | 66.6 | 47.8* | 51.2* | | BrowseComp-ZH (w/ Context Manager) | 73.7 | — | —/— | — | — | — | | GAIA (no file) | 84.5 | 75.1* | 75.6*/75.9* | 61.9* | 64.3* | 78.2* | | xbench-DeepSearch (2025.05) | 83.7 | 78.0* | 76.0*/76.7* | 72.0* | 68.7* | 69.3* | | xbench-DeepSearch (2025.10) | 56.3 | 55.7* | —/40+ | 52.3* | 43.0* | 44.0* | | ResearchRubrics | 65.3 | 55.8* | 56.2*/59.5* | 62.0* | 60.2* | 54.3* | | | | | Reasoning | | | | | AIME 2025 | 97.3 | 93.1 | 94.5/96.1 | 95.7 | 83.0 | 94.1 (95.1*) | | HMMT 2025 (Feb.) | 98.4 | 92.5 | 89.4/95.4 | 97.1 | 71.0* | 84.4 (95.4*) | | HMMT 2025 (Nov.) | 94.0 | 90.2 | 89.2*/— | 93.5 | 74.3* | 91.0* | | IMOAnswerBench | 85.4 | 78.3 | 78.6/81.8 | 82.0 | 60.4* | 80.9* | | | | | Coding | | | | | LiveCodeBench-V6 | 86.4 | 83.3 | 83.1/85.0 | 84.9 | — | 80.6 (81.6*) | | SWE-bench Verified | 74.4 | 73.1 | 71.3/76.8 | 73.8 | 74.0 | 73.4 | | Terminal-Bench 2.0 | 51.0 | 46.4 | 35.7*/50.8 | 41.0 | 47.9 | 38.5 |

Notes: 1. "—" indicates the score is not publicly available or not tested. 2. "*" indicates the original score was inaccessible or lower than our reproduced, so we report the evaluation under the same test conditions as Step 3.5 Flash to ensure fair comparability. 3. BrowseComp (with Context Manager): When the effective context length exceeds a predefined threshold, the agent resets the context and restarts the agent loop. By contrast, Kimi K2.5 and DeepSeek-V3.2 used a "discard-all" strategy. 4. Decoding Cost: Estimates are based on a methodology similar to, but more accurate than, the approach described arxiv.org/abs/2507.19427

Recommended Inference Parameters

1. For general chat domain, we suggest: temperature=0.6, top_p=0.95 2. For reasoning / agent scenario, we recommend: temperature=1.0, top_p=0.95.

4. Architecture Details

Step 3.5 Flash is built on a Sparse Mixture-of-Experts (MoE) transformer architecture, optimized for high throughput and low VRAM usage during inference.

4.1 Technical Specifications

| Component | Specification | | :--- | :--- | | Backbone | 45-layer Transformer (4,096 hidden dim) | | Context Window | 256K | | Vocabulary | 128,896 tokens | | Total Parameters | 196.81B (196B Backbone + 0.81B Head) | | Active Parameters | ~11B (per token generation) |

4.2 Mixture of Experts (MoE) Routing

Unlike traditional dense models, Step 3.5 Flash uses a fine-grained routing strategy to maximize efficiency:

  • Fine-Grained Experts: 288 routed experts per layer + 1 shared expert (always active).
  • Sparse Activation: Only the Top-8 experts are selected per token.
  • Result: The model retains the "memory" of a 196B parameter model but executes with the speed of an 11B model.

4.3 Multi-Token Prediction (MTP)

To improve inference speed, we utilize a specialized MTP Head consisting of a sliding-window attention mechanism and a dense Feed-Forward Network (FFN). This module predicts 4 tokens simultaneously in a single forward pass, significantly…

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

High HF downloads indicate strong community traction