RepoStepFunStepFunpublished Jan 31, 2026seen 5d

stepfun-ai/Step-3.5-Flash

C++

Open original ↗

Captured source

source ↗
published Jan 31, 2026seen 5dcaptured 11hhttp 200method plain

stepfun-ai/Step-3.5-Flash

Description: Fast, Sharp & Reliable Agentic Intelligence

Language: C++

License: Apache-2.0

Stars: 2076

Forks: 84

Open issues: 18

Created: 2026-01-31T02:57:03Z

Pushed: 2026-04-03T12:50:32Z

Default branch: main

Fork: no

Archived: no

README:

English | 简体中文

OpenClaw Guide | Claude Code Guide | Roo Code Guide | Local Agent Guide

1. Introduction

Step 3.5 Flash (visit website) is our most capable open-source foundation model, engineered to deliver frontier reasoning and agentic capabilities with exceptional efficiency. Built on a sparse Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token. This "intelligence density" allows it to rival the reasoning depth of top-tier proprietary models, while maintaining the agility required for real-time interaction.

Contents

  • [Key Capabilities](#2-key-capabilities)
  • [Performance](#3-performance)
  • [Architecture Details](#4-architecture-details)
  • [Quick Start](#5-quick-start)
  • [Local Deployment](#6-local-deployment)
  • [Using Step 3.5 Flash on Agent Platforms](#7-using-step-35-flash-on-agent-platforms)
  • [Cookbooks](#8-cookbooks)
  • [Known Issues and Future Directions](#9-known-issues-and-future-directions)
  • [Co-Developing the Future](#10-co-developing-the-future)
  • [License](#license)

2. Key Capabilities

  • Deep Reasoning at Speed: While chatbots are built for reading, agents must reason fast. Powered by 3-way Multi-Token Prediction (MTP-3), Step 3.5 Flash achieves a generation throughput of 100–300 tok/s in typical usage (peaking at 350 tok/s for single-stream coding tasks). This allows for complex, multi-step reasoning chains with immediate responsiveness.
  • A Robust Engine for Coding & Agents: Step 3.5 Flash is purpose-built for agentic tasks, integrating a scalable RL framework that drives consistent self-improvement. It achieves 74.4% on SWE-bench Verified and 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability.
  • Efficient Long Context: The model supports a cost-efficient 256K context window by employing a 3:1 Sliding Window Attention (SWA) ratio—integrating three SWA layers for every full-attention layer. This hybrid approach ensures consistent performance across massive datasets or long codebases while significantly reducing the computational overhead typical of standard long-context models.
  • Accessible Local Deployment: Optimized for accessibility, Step 3.5 Flash brings elite-level intelligence to local environments. It runs securely on high-end consumer hardware (e.g., Mac Studio M4 Max, NVIDIA DGX Spark), ensuring data privacy without sacrificing performance.

3. Performance

Step 3.5 Flash delivers performance parity with leading closed-source systems while remaining open and efficient.

![](assets/step-bar-chart.png)

Performance of Step 3.5 Flash measured across Reasoning, Coding, and Agentic Capabilities. Open-source models (left) are sorted by their total parameter count, while top-tier proprietary models are shown on the right. xbench-DeepSearch scores are sourced from official publications for consistency. The shadowed bars represent the enhanced performance of Step 3.5 Flash using Parallel Thinking.

Detailed Benchmarks

| Benchmark | Step 3.5 Flash | DeepSeek V3.2 | Kimi K2 Thinking / K2.5 | GLM-4.7 | MiniMax M2.1 | MiMo-V2 Flash | | --- | --- | --- | --- | --- | --- | --- | | # Activated Params | 11B | 37B | 32B | 32B | 10B | 15B | | # Total Params (MoE) | 196B | 671B | 1T | 355B | 230B | 309B | | Est. decoding cost @ 128K context, Hopper GPU | 1.0x 100 tok/s, MTP-3, EP8 | 6.0x 33 tok/s, MTP-1, EP32 | 18.9x 33 tok/s, no MTP, EP32 | 18.9x 100 tok/s, MTP-3, EP8 | 3.9x 100 tok/s, MTP-3, EP8 | 1.2x 100 tok/s, MTP-3, EP8 | | | | | Agent** | | | | | τ²-Bench | 88.2 | 80.3 (85.2*) | 74.3*/85.4* | 87.4 | 86.6* | 80.3 (84.1*) | | BrowseComp | 51.6 | 51.4 | 41.5* / 60.6 | 52.0 | 47.4 | 45.4 | | BrowseComp (w/ Context Manager) | 69.0 | 67.6 | 60.2/74.9 | 67.5 | 62.0 | 58.3 | | BrowseComp-ZH | 66.9 | 65.0 | 62.3 / 62.3* | 66.6 | 47.8* | 51.2* | | BrowseComp-ZH (w/ Context Manager) | 73.7 | — | —/— | — | — | — | | GAIA (no file) | 84.5 | 75.1* | 75.6*/75.9* | 61.9* | 64.3* | 78.2* | | xbench-DeepSearch (2025.05) | 83.7 | 78.0* | 76.0*/76.7* | 72.0* | 68.7* | 69.3* | | xbench-DeepSearch (2025.10) | 56.3 | 55.7* | —/40+ | 52.3* | 43.0* | 44.0* | | ResearchRubrics | 65.3 | 55.8* | 56.2*/59.5* | 62.0* | 60.2* | 54.3* | | | | | Reasoning | | | | | AIME 2025 | 97.3 | 93.1 | 94.5/96.1 | 95.7 | 83.0 | 94.1 (95.1*) | | HMMT 2025 (Feb.) | 98.4 | 92.5 | 89.4/95.4 | 97.1 | 71.0* | 84.4 (95.4*) | | HMMT 2025 (Nov.) | 94.0 | 90.2 | 89.2*/— | 93.5 | 74.3* | 91.0* | | IMOAnswerBench | 85.4 | 78.3 | 78.6/81.8 | 82.0 | 60.4* | 80.9* | | | | | Coding | | | | | LiveCodeBench-V6 | 86.4 | 83.3 | 83.1/85.0 | 84.9 | — | 80.6 (81.6*) | | SWE-bench Verified | 74.4 | 73.1 | 71.3/76.8 | 73.8 | 74.0 | 73.4 | | Terminal-Bench 2.0 | 51.0 | 46.4 | 35.7*/50.8 | 41.0 | 47.9 | 38.5 |

Notes

  • "—" indicates the score is not publicly available or not tested.
  • "*" indicates the original score was inaccessible or lower than our reproduced, so we report the evaluation under the same test conditions as Step 3.5 Flash to ensure fair comparability.
  • BrowseComp (with Context Manager): when the effective context length exceeds a predefined threshold, the agent resets the context and restarts the agent loop. (By contrast, Kimi K2.5 and DeepSeek-V3.2 used a discard-all strategy.)
  • In decoding cost section, decoding is estimated using a similar but more accurate approach than arxiv.org/abs/2507.19427.

4. Architecture Details

Step 3.5 Flash is built on a Sparse Mixture-of-Experts (MoE) transformer architecture, optimized for high throughput and low VRAM usage during inference.

4.1 Technical Specifications

| Component | Specification | | :--- | :--- | | Backbone | 45-layer Transformer (4,096 hidden dim) | | Context Window | 256K | | Vocabulary | 128,896 tokens | | Total Parameters | 196.81B (196B Backbone + 0.81B Head) | | Active Parameters | ~11B (per token generation) |

4.2 Mixture of Experts (MoE)…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable flash model release