StepFun analysis
Thesis
StepFun is a Shanghai-based frontier AI lab executing an unusually broad multimodal strategy—spanning video generation, real-time voice, image editing, 3D asset generation, formal mathematics, GUI agents, and deep research—while releasing the vast majority of its model weights, code, and benchmarks under permissive open-source licenses. The lab's May 2026 release of Step 3.7 Flash, a 198B MoE vision-language model with native image/video input and 256K context E3W1W4, paired with the June 2026 launch of a voice-driven CLI coding assistant P1P2, signals a push toward agentic and real-time multimodal product surfaces. StepFun's pace of open-weight releases—spanning 10B to 321B parameters across text, vision, audio, and image generation—positions it as one of the most prolific open-source contributors among frontier labs, with commercial monetization routed through its API platform and NVIDIA NIM distribution W1W2P6.
Signal desks
Hiring
No cited evidence in this pack.
Forks
- vllm — Forked from
vllm-project/vllm(Feb 2025). The fork contains astep-audiobranch, indicating StepFun maintains a customized vLLM backend for its audio models. 45 stars. E53P12 - llama.cpp — Forked from
ggml-org/llama.cpp(Mar 2026). Supports GGUF quantization serving for Step 3.7 Flash, suggesting investment in local/edge inference compatibility. 6 stars. E60W2
Releases
- Step 3.7 Flash — May 2026. 198B MoE vision-language model (201B params), Apache 2.0. BF16, FP8, NVFP4, GGUF checkpoints. 142K HF downloads, 392 likes. Native image/video input, 256K context, three reasoning levels. Distributed as NVIDIA NIM microservice. E3W1W2W3W4
- Step 3.5 Flash — Feb 2026. Text-generation model (~199B params). 224K HF downloads, 820 likes. Plus base and midtrain checkpoints released Mar 2026. E1E10E16
- Step3 — Jul 2025. 321B MoE multimodal reasoning model (38B active). 156K HF downloads, 166 likes. E5P14
- Step3-VL-10B — Jan 2026. Compact 10B multimodal model, 469K HF downloads, 410 likes. FP8 variant also released. E2E12E35P28
- Step-Audio-R1 / R1.1 / R1.5 — Nov 2025–Apr 2026. Speech reasoning models (33B params). Apache 2.0. E4E7P19
- Step-Audio-EditX — Oct 2025. 3B-parameter RL-based audio editing model. 929 GitHub stars. Multi-language (JP, KR added Nov 2025). E41P18
- Step-Audio2 / mini / mini-Think — Jul–Sep 2025. End-to-end multimodal LLM for audio understanding and speech conversation. 1,460 GitHub stars. E22P12
- Step-Video-T2V / Turbo / TI2V — Feb–Mar 2025. Text-to-video and image-to-video models. T2V: 3,185 stars. TI2V: 375 stars. Both with model weights and tech reports. E18E49P3P4
- Step1X-Edit / v1p2 — Apr 2025–Apr 2026. SOTA open-source image editing. v1: 2,224 stars. v1p2: HF release with diffusers support. Step Image Edit 2 released Apr 2026 (API-only, sub-2s generation). E23E11E25E40P8
- NextStep-1 / 1.1 / Large — Aug 2025–Feb 2026. Autoregressive image generation with continuous tokens (14B params). ICLR 2026 Oral. 689 stars. E9E13E21E24E27E28E38P15
- Step1X-3D — May 2025. High-fidelity textured 3D asset generation. 869 stars. E8E42P10
- GELab-Zero-4B-preview — Nov 2025. Open-source GUI agent with model and infrastructure. 2,191 GitHub stars, MCP server support. E6E31P22
- StepDeepResearch — Nov 2025. Cost-effective end-to-end deep research agent. 561 stars. Latest: Step 3.5 Flash achieves 65.27 on ResearchRubrics. E46P23
- PaCoRe-8B / RLVR-8B — Dec 2025. Parallel coordinated reasoning framework. 94.5% on HMMT 2025, surpassing GPT-5's 93.2%. MIT license. 334 stars. E15E30E36P24
- StepFun-Prover-Preview (7B/32B) — Aug 2025. Formal theorem proving in Lean 4. 70% pass@1 on miniF2F-test (32B). E33E45P16
- StepFun-Formalizer (7B/32B) — Aug 2025. Autoformalization to Lean 4. E32E39P17
- StepMesh — Jul 2025. High-performance C++ communication library for Attention-FFN disaggregation. 367 stars. E50P13
- SteptronOss — Dec 2025. Lightweight training framework for SFT, RLVR, and evaluation. 575 stars. E37P27
- Step-Realtime-CLI v0.1.0 — Jun 2026. Terminal-based AI coding assistant with real-time voice. TypeScript, MIT. E17P1P2
- Step-Realtime-Console — Mar 2025. Svelte-based real-time voice API demo with WaveSurfer visualization. 74 stars. E51P6
- ComfyUI-StepVideo — Mar 2025. Custom ComfyUI nodes for StepVideo TI2V/T2V. 43 stars. E54P7
- Step-Audio-Edit-Benchmark — Nov 2025. Evaluation framework for controllable speech synthesis. 20 stars. E58P21
- InfiniteHBD-Trace — May 2025. SIGCOMM 2025 fault-trace dataset from 400 GPU servers for LLM pretraining. E59P11
- GEBench — Feb 2026. (Python repo, 54 stars). E52
- StepAudio-Skills — Mar 2026. Audio skills for Claw. E57
- stepfunApi-audio-sdk — Nov 2025. Android audio SDK (Kotlin) for TTS/ASR. P20
- Qwen2.5-32B-DialogueReason — May 2025. Dialogue reasoning model based on Qwen2.5. E29
Talking
- Step 3.7 Flash launch narrative: Blog post and GitHub README emphasize enterprise agentic workflows—"Fast, Sharp & Reliable Agentic Intelligence"—with vision input, tool-use reliability, and multi-level reasoning. Ecosystem support spans vLLM, SGLang, llama.cpp, NVIDIA Nemo (AutoModel, Megatron Core, Megatron Bridge), and NVIDIA NIM. W1W2W4
- Third-party coverage (MarkTechPost): Frames Step 3.7 Flash as targeting "coding agents and search workflows," noting the multimodal MoE architecture and Apache 2.0 licensing with BF16/FP8/NVFP4/GGUF weights. W3
- Third-party coverage (NVIDIA Technical Blog): Highlights enterprise readiness, NVFP4 quantization for reduced memory bandwidth, and 256K context window for financial analysis and concurrent coding agents. W4
- Voice AI narrative (Decrypt): Covers StepAudio 2.5 Realtime as an end-to-end voice model (audio-in, audio-out, no text conversion) with customizable personas and a "soul-level companion" named Xiao Yue. Positions StepFun as competing on voice benchmarks. W5
- Developer tool framing (Step-Realtime-CLI): Positions the CLI as a voice-driven coding assistant—
step voicefor spoken code editing,step exec --mode planfor read-only planning. Dual-region deployment (Mainland China and Overseas) with separate API endpoints. P1W6 - Academic recognition: NextStep-1 received ICLR 2026 Oral P15; InfiniteHBD-Trace accepted at ACM SIGCOMM 2025 P11; multiple arXiv tech reports accompany model releases P3P4P5P8P10P12P14P15P16P17P18P19P23P24P28.
Shipping
StepFun's shipping cadence is unusually high and spans modalities. Key shipped artifacts with inspectable evidence:
- Models on Hugging Face: 30+ distinct model repositories released since Feb 2025, all under Apache 2.0 or MIT licenses, covering text (Step3 family: 321B→3.5 Flash→3.7 Flash), vision-language (Step3-VL-10B), audio (Step-Audio, Audio2, Audio-R1, Audio-EditX), video (T2V, TI2V), image generation (NextStep-1, Step1X-Edit, Step1X-3D), reasoning (PaCoRe), and agents (GELab-Zero). E21E24E25E32E33E35E36E45
- Infrastructure software: StepMesh (C++ communication library for attention-FFN disaggregation) P13; SteptronOss (training framework for SFT/RLVR/eval) P27; Step-Realtime-CLI (voice coding tool) P1P2; Step-Realtime-Console (voice API Svelte demo) P6; ComfyUI-StepVideo (custom nodes) P7; Android audio SDK P20.
- Datasets and benchmarks: AndroidDaily (GUI agent dataset) P22; Step-Audio-Edit-Benchmark P21; Step-Video-TI2V-Eval P4; StepEval-Audio-360 P5; GEdit-Bench P8; InfiniteHBD-Trace (SIGCOMM 2025 fault traces) P11; PaCoRe-Train-8k P24; three in-house benchmarks from Step-Audio-R1.5 P19.
- Platform: API available at both platform.stepfun.com (Mainland China) and platform.stepfun.ai (Overseas). Models served via API with NVIDIA NIM distribution for enterprise. P1P6P8P23W1W2
- Notable gaps: No cited evidence of mobile app releases, chat products, or consumer-facing products beyond the API platform and developer tools.
Research themes
StepFun's research agenda clusters around five interconnected themes:
1. Multimodal foundation models at scale: The Step3 family demonstrates a progression from a 321B MoE (38B active) P14 to the 198B Step 3.7 Flash with native vision E3W4 and the compact 10B Step3-VL-10B that "matches open-source models 10-20x its size" P28. All use MoE architectures optimized for inference cost.
2. Real-time speech interaction: A sustained investment from Step-Audio (Feb 2025, now deprecated) P5 through Step-Audio2 (end-to-end multimodal LLM for speech) P12, Step-Audio-R1/R1.5 (speech reasoning with RL) P19, to Step-Audio-EditX (3B RL-based audio editing with emotion/paralinguistic control) P18. The Realtime Console P6, Realtime CLI P1, and Android SDK P20 form the application layer.
3. Visual content generation: Separate tracks for video (T2V diffusion, TI2V with motion control) P3P4, image editing (Step1X-Edit targeting GPT-4o/Gemini parity) P8, autoregressive image generation (NextStep-1 with continuous tokens, ICLR 2026 Oral) P15, and 3D asset generation (Step1X-3D) P10.
4. Reasoning and formal methods: PaCoRe introduces parallel coordinated reasoning that scales test-time compute to ~2M tokens, achieving 94.5% on HMMT 2025 (surpassing GPT-5) P24. StepFun-Prover-Preview achieves 70% pass@1 on miniF2F-test via tool-integrated RL with Lean 4 P16. StepFun-Formalizer tackles autoformalization P17. StepDeepResearch applies agentic reasoning to research tasks P23.
5. Agent infrastructure: GELab-Zero provides a fully open-source GUI agent (model + infrastructure, MCP server support, "no cloud dependencies") P22. Step-Realtime-CLI extends agent capabilities to voice-driven coding P1. SteptronOss supports the RLVR training pipeline that underlies reasoning models P27. StepMesh addresses the infrastructure challenge of attention-FFN disaggregation for serving large models P13.
Hiring & scaling
No cited evidence in this pack. No open job listings, team descriptions, location expansions, or hiring signals appear in any of the provided sources.
Category implications
Strategy: StepFun is executing a broad-coverage open-source strategy that releases model weights, code, benchmarks, and training frameworks under Apache 2.0/MIT while monetizing through API access (platform.stepfun.com / platform.stepfun.ai) and NVIDIA NIM enterprise distribution W1W2W4. The dual-region operational split (Mainland China vs. Overseas with separate accounts and API endpoints) P1 suggests deliberate compliance architecture for cross-border deployment.
Infrastructure: The StepMesh library for attention-FFN disaggregation P13, the SteptronOss training framework P27, forks of vLLM and llama.cpp E53E60, and NVIDIA Nemo ecosystem integration (AutoModel, Megatron Core, Megatron Bridge) W1 collectively indicate heavy investment in both training and inference infrastructure. The InfiniteHBD-Trace dataset (SIGCOMM 2025) reveals operation of GPU clusters with hundreds of nodes for LLM pretraining P11.
Product: The Step-Realtime-CLI release signals a move toward developer-facing voice-coding tools as a product surface P1P2. The Realtime Console P6 and Android SDK P20 suggest platform ambitions in real-time voice. The GELab-Zero GUI agent with MCP server support targets multi-device agent orchestration P22. StepDeepResearch targets the research agent market P23.
Research: The breadth—spanning video, audio, images, 3D, code, formal math, and agents—is unusual for a single lab and suggests a "multimodal everything" research thesis. The ICLR 2026 Oral (NextStep-1) P15 and SIGCOMM 2025 (InfiniteHBD-Trace) P11 acceptances indicate academic credibility. The consistent release of tech reports alongside model weights on arXiv P3P4P5P8P10P12P14P15P16P17P18P19P23P24P28 suggests a deliberate research communication strategy.
GTM: The lab reaches developers through GitHub (cumulative stars exceeding 15,000 across repos), Hugging Face (30+ model repositories), academic papers, and NVIDIA's enterprise channel. The third-party coverage from Decrypt W5, MarkTechPost W3, and NVIDIA's blog W4 suggests an active media relations function. The WeChat and Discord community channels P8P12P23 indicate developer community building.
Traction highlights
- GitHub stars: Step-Video-T2V: 3,185 P3; Step1X-Edit: 2,224 P8; GELab-Zero: 2,191 P22; Step 3.5 Flash repo: 2,089 E34; Step-Audio2: 1,460 P12; Step-Audio-EditX: 929 P18; Step1X-3D: 869 P10; NextStep-1: 689 P15; Step-Audio-R1: 673 P19; SteptronOss: 575 P27; StepDeepResearch: 561 P23; Step3: 453 P14; Step3-VL-10B: 407 P28; Step-Video-TI2V: 375 P4; StepMesh: 367 P13; PaCoRe: 334 P24; Step 3.7 Flash repo: 253 E20
- HF downloads: Step3-VL-10B: 469K E2; Step 3.5 Flash: 224K E1; Step3: 156K E5; Step 3.7 Flash: 142K E3; NextStep-1.1: 6,222 E21; GELab-Zero-4B: 639 E6; Step1X-Edit-v1p2: 507 E11; Step-Audio-R1: 435 E7; Step-Audio-R1.1: 260 E4; Step3-VL-10B-FP8: 198 E35; StepFun-Formalizer-7B: 168 E39; Step1X-Edit-v1p1-diffusers: 161 E40; StepFun-Formalizer-32B: 128 E32
- Academic recognition: ICLR 2026 Oral (NextStep-1) P15; SIGCOMM 2025 (InfiniteHBD-Trace) P11
- Press coverage: Decrypt W5, MarkTechPost W3, NVIDIA Technical Blog W4