stepfun-ai/Step-Audio
Python
Captured source
source ↗stepfun-ai/Step-Audio
Language: Python
License: Apache-2.0
Stars: 27
Forks: 1
Open issues: 68
Created: 2025-02-11T05:35:12Z
Pushed: 2026-03-16T03:53:07Z
Default branch: main
Fork: no
Archived: no
README:
中文  |   English   |  日本語
开发者微信交流群、Developer Group
This repository is no longer maintained, please refer to:
Step-Audio2&Step-Audio2-mini for End-to-end speech conversation
Step-Audio-R1&Step-Audio-R1.1 for Speech Reasoning.
Step-Audio-EditX for Audio Editing.
Step-Audio
🔥🔥🔥 News!!
- Aug 29, 2025: 👋 We release Step-Audio 2 & Step-Audio 2 mini and their corresponding inference [examples](examples.py). Technical report is also updated.
- Jun 10, 2025: 👋 We release the technical report of Step-Audio-AQAA.
- Feb 17, 2025: 👋 We release the inference code and model weights of Step-Audio-Chat, Step-Audio-TTS-3B and Step-Audio-Tokenizer
- Feb 17, 2025: 👋 We release the multi-turn audio benchmark of StepEval-Audio-360.
- Feb 17, 2025: 👋 We release the technical report of Step-Audio.
Table of Contents
1. [Introduction](#1-introduction) 2. [Model Summary](#2-model-summary) 3. [Model Download](#3-model-download) 4. [Model Usage](#4-model-usage) 5. [Benchmark](#5-benchmark) 6. [Online Engine](#6-online-engine) 7. [Examples](#7-examples) 8. [Acknowledgements](#8-acknowledgements) 9. [License Agreement](#9-license-agreement) 10. [Citation](#10-citation)
1. Introduction
Step-Audio is the first production-ready open-source framework for intelligent speech interaction that harmonizes comprehension and generation, supporting multilingual conversations (e.g., Chinese, English, Japanese), emotional tones (e.g., joy/sadness), regional dialects (e.g., Cantonese/Sichuanese), adjustable speech rates, and prosodic styles (e.g., rap). Step-Audio demonstrates four key technical innovations:
- 130B-Parameter Multimodal Model: A single unified model integrating comprehension and generation capabilities, performing speech recognition, semantic understanding, dialogue, voice cloning, and speech synthesis. We have made the 130B Step-Audio-Chat variant open source.
- Generative Data Engine: Eliminates traditional TTS's reliance on manual data collection by generating high-quality audio through our 130B-parameter multimodal model. Leverages this data to train and publicly release a resource-efficient Step-Audio-TTS-3B model with enhanced instruction-following capabilities for controllable speech synthesis.
- Granular Voice Control: Enables precise regulation through instruction-based control design, supporting multiple emotions (anger, joy, sadness), dialects (Cantonese, Sichuanese, etc.), and vocal styles (rap, a cappella humming) to meet diverse speech generation needs.
- Enhanced Intelligence: Improves agent performance in complex tasks through ToolCall mechanism integration and role-playing enhancements.
2. Model Summary
In Step-Audio, audio streams are tokenized via a dual-codebook framework, combining parallel semantic (16.7Hz, 1024-entry codebook) and acoustic (25Hz, 4096-entry codebook) tokenizers with 2:3 temporal interleaving. A 130B-parameter LLM foundation (Step-1) is further enhanced via audio-contextualized continual pretraining and task-specific post-training, enabling robust cross-modal speech understanding. A hybrid speech decoder combining flow matching with neural vocoding, optimized for real-time waveform generation. A streaming-aware architecture featuring speculative response generation (40\% commit rate) and text-based context management (14:1 compression ratio) for efficient cross-modal alignment. 
2.1 Tokenizer
We implement a token-level interleaving approach to effectively integrate semantic tokenization and acoustic tokenization. The semantic tokenizer employs a codebook size of 1024, while the acoustic tokenizer utilizes a larger codebook size of 4096 to capture finer acoustic details. Given the differing token rates, we establish a temporal alignment ratio of 2:3, where every two semantic tokens are paired with three acoustic tokens.
2.2 Language Model
To enhance Step-Audio’s ability to effectively process speech information and achieve accurate speech-text alignment, we conducted audio continual pretrain-ing based on Step-1, a 130-billion parameter pretrained text-based large language model (LLM).
2.3 Speech Decoder
The speech decoder in Step-Audio serves a critical function in converting discrete speech tokens, which contain both semantic and acoustic information, into continuous time-domain waveforms that represent natural speech. The decoder architecture incorporates a flow matching model and a mel-to-wave vocoder. To optimize the intelligibility and naturalness of the synthesized speech, the speech decoder is trained using a dual-code interleaving approach, ensuring seamless integration of semantic and acoustic features throughout the generation process.
2.4 Real-time Inference Pipeline
To enable real-time interactions, we have designed an optimized inference pipeline. At its core, the Controller module manages state transitions, orchestrates speculative response generation, and ensures seamless coordination between critical subsystems. These subsystems include Voice Activity Detection (VAD) for detecting user speech, the Streaming Audio Tokenizer for processing audio in real-time, the Step-Audio language model and Speech Decoder for processing and generating responses, and the Context Manager for preserving conversational continuity. 
2.5 Post training details
In the post-training phase, we conducted task-specific Supervised Fine-Tuning (SFT) for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). For Audio Input Text Output (AQTA) tasks, we implemented SFT using diversified high-quality datasets combined with Reinforcement Learning from Human…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Low stars, trivial new repo