ModelStepFunStepFunpublished May 23, 2026seen 5d

stepfun-ai/Step-3.7-Flash

Open original ↗

Captured source

source ↗
published May 23, 2026seen 5dcaptured 9hhttp 200method plaintask image-text-to-textlicense apache-2.0library transformersparams 201Bdownloads 50klikes 363

[ModelPage]: https://static.stepfun.com/blog/step-3.7-flash/

1. Introduction

Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter vision encoder for native image understanding. Engineered for high-frequency production workloads, it activates approximately 11B parameters per token and delivers a throughput of up to 400 tokens per second. Step 3.7 Flash supports a 256k context window and offers three selectable reasoning levels (low, medium, and high) so developers can easily balance speed, cost, and cognitive depth.

We built Step 3.7 Flash for developers who need to scale agentic workflows that combine perception, search, and reasoning. It is designed to handle intensive tasks such as parsing massive financial reports in one pass, running multi-step search loops with cross-source verification, or operating concurrent coding agents in high-throughput pipelines.

2. Capabilities & Performance

Multimodal Perception and Verification

The model delivers top-tier visual intelligence, securing first place on SimpleVQA (Search) with a 79.2 and achieving frontier parity on V* (Python) at 95.3. These metrics reflect strong visual grounding and retrieval-augmented reasoning beyond basic image description. The model accurately processes dense visual interfaces, such as UI wireframes, application GUIs, and data charts, to map them into structured code. When it encounters an incomplete visual asset, it can independently identify missing data and execute lookups to verify context before returning a factually verified conclusion.

Workflow Integrity and Tool Orchestration

Execution reliability is critical for autonomous agents. Step 3.7 Flash leads the ClawEval-1.1 benchmark with a score of 67.1, which significantly outperforms the next closest competitor at 59.8. This performance demonstrates high resistance to adversarial traps and strict adherence to system policies during multi-turn orchestration. Backed by scores of 49.5 on Toolathlon and 48.1 on HLE w. Tool, this profile ensures high trajectory integrity. Step 3.7 Flash reliably interacts with external APIs and executes long-horizon workflows without drifting from instructions or violating system constraints.

Code Engineering and Professional Baselines

Step 3.7 Flash is built for live engineering tasks and secured a definitive second-place finish on SWE-Bench PRO with a score of 56.3. It can independently trace multi-file repositories, isolate bugs from raw issue reports, and generate functional patches that pass automated unit tests. While evaluations like Terminal-Bench 2.1 (59.5) and GDPVal-AA (45.8) show clear areas for future optimization compared to the absolute peak of the cohort, they establish a dependable baseline for system interactions and structured professional deliverables.

![Step 3.7 Flash benchmark results across General Agent, Agentic Coding, and Multimodal evaluations](assets/benchmarks.png)

3. Pricing

| Token Type | Price | |---|---| | Input (cache miss) | $0.20 / M tokens | | Input (cache hit) | $0.04 / M tokens | | Output | $1.15 / M tokens |

4. Availability, Deployment, and Ecosystem

  • Availability: Step 3.7 Flash is available on the StepFun Open Platform — platform.stepfun.ai (Global) and platform.stepfun.com (China), OpenRouter, and NVIDIA NIM. StepFun is also partnering with DeepInfra, Fireworks AI, and Modal to expand availability soon.
  • Deployment: Step 3.7 Flash supports flexible deployment across cloud, data center, and local environments. For large-scale production and enterprise use cases, Step 3.7 Flash can be deployed on modern data center infrastructure. For local and workstation scenarios, it can also run on high-memory devices such as NVIDIA DGX Station, AMD Ryzen AI Max+ 395-based systems, and Mac Studio / Macbook Pro devices with at least 128GB unified memory.
  • Ecosystem: Step 3.7 Flash is supported across popular open-source infrastructure for both inference and model development. For inference and serving, developers can use vLLM, SGLang, Hugging Face Transformers, and llama.cpp. For model development & customization workflows, StepFun model support has landed in the NVIDIA Nemo ecosystem, including AutoModel, Megatron Core and Megatron Bridge. Step 3.7 Flash is also available as an NVIDIA NIM inference microservice for on-prem, cloud, or hybrid deployment.

5. Examples

You can get started with Step 3.7 Flash in minutes using StepFun's API or via other inference providers.

> Pick the right base_url for your region. StepFun operates two regional platforms with separate API hosts. The base_url you pass to the OpenAI client must match the platform where your API key was issued, otherwise requests will be rejected as unauthorized. > > - Global: platform.stepfun.aibase_url=https://api.stepfun.ai/v1 > - China: platform.stepfun.combase_url=https://api.stepfun.com/v1 > > To avoid hard-coding the wrong region, the examples below read both the API key and base URL from environment variables. Export them once before running: > > ``bash > export STEP_API_KEY="sk-..." > export STEP_BASE_URL="https://api.stepfun.ai/v1" # use https://api.stepfun.com/v1 for the China platform >

5.1 Chat Example

import os
from openai import OpenAI

client = OpenAI(
api_key=os.environ["STEP_API_KEY"],
base_url=os.environ["STEP_BASE_URL"],
)

completion = client.chat.completions.create(
model="step-3.7-flash",
messages=[
{
"role": "system",
"content": "You are an AI assistant provided by StepFun. You are good at Chinese, English, and many other languages, and you can see, think, and act to help users get things done.",
},
{
"role": "user",
"content": "Introduce StepFun's artificial intelligence capabilities."
},
],
)

print(completion)

5.2 Text and Image Input Example

import os
from openai import OpenAI

client = OpenAI(
api_key=os.environ["STEP_API_KEY"],
base_url=os.environ["STEP_BASE_URL"],
)

completion = client.chat.completions.create(
model="step-3.7-flash",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this picture?"},
{
"type": "image_url",
"image_url": {"url": "https://example.com/photo.jpg"},
},
],
},
],
)

print(completion)

6. Local Deployment…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable model release with solid traction