RepoStepFunStepFunpublished Feb 11, 2025seen 5d

stepfun-ai/Step-Audio

Python

Open original ↗

Captured source

source ↗
published Feb 11, 2025seen 5dcaptured 11hhttp 200method plain

stepfun-ai/Step-Audio

Language: Python

License: Apache-2.0

Stars: 27

Forks: 1

Open issues: 68

Created: 2025-02-11T05:35:12Z

Pushed: 2026-03-16T03:53:07Z

Default branch: main

Fork: no

Archived: no

README:

中文 &nbsp| &nbsp English&nbsp&nbsp | &nbsp日本語

开发者微信交流群、Developer Group

This repository is no longer maintained, please refer to:

Step-Audio2&Step-Audio2-mini for End-to-end speech conversation

Step-Audio-R1&Step-Audio-R1.1 for Speech Reasoning.

Step-Audio-EditX for Audio Editing.

Step-Audio

🔥🔥🔥 News!!

Table of Contents

1. [Introduction](#1-introduction) 2. [Model Summary](#2-model-summary) 3. [Model Download](#3-model-download) 4. [Model Usage](#4-model-usage) 5. [Benchmark](#5-benchmark) 6. [Online Engine](#6-online-engine) 7. [Examples](#7-examples) 8. [Acknowledgements](#8-acknowledgements) 9. [License Agreement](#9-license-agreement) 10. [Citation](#10-citation)

1. Introduction

Step-Audio is the first production-ready open-source framework for intelligent speech interaction that harmonizes comprehension and generation, supporting multilingual conversations (e.g., Chinese, English, Japanese), emotional tones (e.g., joy/sadness), regional dialects (e.g., Cantonese/Sichuanese), adjustable speech rates, and prosodic styles (e.g., rap). Step-Audio demonstrates four key technical innovations:

  • 130B-Parameter Multimodal Model: A single unified model integrating comprehension and generation capabilities, performing speech recognition, semantic understanding, dialogue, voice cloning, and speech synthesis. We have made the 130B Step-Audio-Chat variant open source.
  • Generative Data Engine: Eliminates traditional TTS's reliance on manual data collection by generating high-quality audio through our 130B-parameter multimodal model. Leverages this data to train and publicly release a resource-efficient Step-Audio-TTS-3B model with enhanced instruction-following capabilities for controllable speech synthesis.
  • Granular Voice Control: Enables precise regulation through instruction-based control design, supporting multiple emotions (anger, joy, sadness), dialects (Cantonese, Sichuanese, etc.), and vocal styles (rap, a cappella humming) to meet diverse speech generation needs.
  • Enhanced Intelligence: Improves agent performance in complex tasks through ToolCall mechanism integration and role-playing enhancements.

2. Model Summary

In Step-Audio, audio streams are tokenized via a dual-codebook framework, combining parallel semantic (16.7Hz, 1024-entry codebook) and acoustic (25Hz, 4096-entry codebook) tokenizers with 2:3 temporal interleaving. A 130B-parameter LLM foundation (Step-1) is further enhanced via audio-contextualized continual pretraining and task-specific post-training, enabling robust cross-modal speech understanding. A hybrid speech decoder combining flow matching with neural vocoding, optimized for real-time waveform generation. A streaming-aware architecture featuring speculative response generation (40\% commit rate) and text-based context management (14:1 compression ratio) for efficient cross-modal alignment. ![Architecture](assets/architecture.png)

2.1 Tokenizer

We implement a token-level interleaving approach to effectively integrate semantic tokenization and acoustic tokenization. The semantic tokenizer employs a codebook size of 1024, while the acoustic tokenizer utilizes a larger codebook size of 4096 to capture finer acoustic details. Given the differing token rates, we establish a temporal alignment ratio of 2:3, where every two semantic tokens are paired with three acoustic tokens.

2.2 Language Model

To enhance Step-Audio’s ability to effectively process speech information and achieve accurate speech-text alignment, we conducted audio continual pretrain-ing based on Step-1, a 130-billion parameter pretrained text-based large language model (LLM).

2.3 Speech Decoder

The speech decoder in Step-Audio serves a critical function in converting discrete speech tokens, which contain both semantic and acoustic information, into continuous time-domain waveforms that represent natural speech. The decoder architecture incorporates a flow matching model and a mel-to-wave vocoder. To optimize the intelligibility and naturalness of the synthesized speech, the speech decoder is trained using a dual-code interleaving approach, ensuring seamless integration of semantic and acoustic features throughout the generation process.

2.4 Real-time Inference Pipeline

To enable real-time interactions, we have designed an optimized inference pipeline. At its core, the Controller module manages state transitions, orchestrates speculative response generation, and ensures seamless coordination between critical subsystems. These subsystems include Voice Activity Detection (VAD) for detecting user speech, the Streaming Audio Tokenizer for processing audio in real-time, the Step-Audio language model and Speech Decoder for processing and generating responses, and the Context Manager for preserving conversational continuity. ![Inference Pipeline](assets/pipeline.png)

2.5 Post training details

In the post-training phase, we conducted task-specific Supervised Fine-Tuning (SFT) for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). For Audio Input Text Output (AQTA) tasks, we implemented SFT using diversified high-quality datasets combined with Reinforcement Learning from Human…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Low stars, trivial new repo