stepfun-ai/Step-Audio2
Python
Captured source
source ↗stepfun-ai/Step-Audio2
Description: Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
Language: Python
License: Apache-2.0
Stars: 1460
Forks: 107
Open issues: 55
Created: 2025-07-15T09:14:32Z
Pushed: 2026-03-16T04:06:21Z
Default branch: main
Fork: no
Archived: no
README:
Step-Audio 2
🔥🔥🔥 News!!
- Sep 15, 2025: 👋 We release Step-Audio 2 mini Think and its corresponding [examples](examples-think.py).
- Sep 3, 2025: 👋 We release our vLLM backend and corresponding [examples](examples-vllm.py).
- Aug 29, 2025: 👋 We are pleased to open-source Step-Audio 2 mini, Step-Audio 2 mini Base and their corresponding inference [examples](examples.py). Technical report is also updated.
- Jul 24, 2025: 👋 We release demonstration videos for Step-Audio 2.
- Jul 23, 2025: 👋 We release our benchmark for paralinguistic information understanding, StepEval-Audio-Paralinguistic.
- Jul 23, 2025: 👋 We release our benchmark for tool calling, StepEval-Audio-Toolcall.
- Jul 23, 2025: 👋 We release the technical report of Step-Audio 2.
WeChat Developer Group
Introduction
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
- Advanced Speech and Audio Understanding: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information.
- Intelligent Speech Conversation: Achieving natural and intelligent interactions that are contextually appropriate for various conversational scenarios and paralinguistic information.
- Emotional Reasoning: Analyzing user's paralinguistic information such as age and emotion, leading to more accurate and intelligent interpretation of the audio context.
- Tool Calling and Multimodal RAG: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech.
- State-of-the-Art Performance: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See [Evaluation](#evaluation) and Technical Report).
+ Open-source: Step-Audio 2 mini, Step-Audio 2 mini Base and Step-Audio 2 mini Think are released under [Apache 2.0](LICENSE) license.
Model Download
| Models | 🤗 Hugging Face | ModelScope | |-------|-------|-------| | Step-Audio 2 mini | stepfun-ai/Step-Audio-2-mini | stepfun-ai/Step-Audio-2-mini | | Step-Audio 2 mini Base | stepfun-ai/Step-Audio-2-mini-Base | stepfun-ai/Step-Audio-2-mini-Base | | Step-Audio 2 mini Think | stepfun-ai/Step-Audio-2-mini-Think | stepfun-ai/Step-Audio-2-mini-Think |
Model Usage
🔧 Dependencies and Installation
- Python >= 3.10
- PyTorch >= 2.3-cu121
- CUDA Toolkit
conda create -n stepaudio2 python=3.10 conda activate stepaudio2 pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml git clone https://github.com/stepfun-ai/Step-Audio2.git cd Step-Audio2 git lfs install git clone https://huggingface.co/stepfun-ai/Step-Audio-2-mini # git clone https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base
🔧 vLLM docker image
We highly recommend using our vLLM backend for faster and streaming inference, also deploying across multiple GPUs.
# (Optional) build the docker image yourself (very slow and requires 32GiB of memory) # docker build -t stepfun2025/vllm:step-audio-2-v20250909 . # run vLLM docker docker run --rm -ti --gpus all \ -v Step-Audio-2-mini:/Step-Audio-2-mini \ -p 8000:8000 \ stepfun2025/vllm:step-audio-2-v20250909 \ -- vllm serve /Step-Audio-2-mini \ --served-model-name step-audio-2-mini \ --port 8000 \ --max-model-len 16384 \ --max-num-seqs 32 \ --tensor-parallel-size 1 \ --enable-auto-tool-choice \ --tool-call-parser step_audio_2 \ --tokenizer-mode step_audio_2 \ --chat_template_content_format string \ --audio-parser step_audio_2_tts_ta4 \ --trust-remote-code
🚀 Inference Scripts
python examples.py # python examples-base.py # python examples-vllm.py # python examples-think.py
🚀 Local web demonstration
pip install gradio python web_demo.py # python web_demo_vllm.py
Online demonstration
StepFun realtime console
- Both Step-Audio 2 and Step-Audio 2 mini are available in our StepFun realtime console with web search tool enabled.
- You will need an API key from the StepFun Open Platform.
StepFun AI Assistant
- Step-Audio 2 is also available in our StepFun AI Assistant mobile App with both web and audio search tools enabled.
- Please scan the following QR code to download it from your app store then tap the phone icon in the top-right corner.
WeChat group
You can scan the following QR code to join our WeChat group for communication and discussion.
Evaluation
Automatic speech recognition
CER for Chinese, Cantonese and Japanese and WER for Arabian and English. N/A indicates that the language is…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable audio model repo with 1.46k stars