stepfun-ai/Step-Audio-EditX
Python
Captured source
source ↗stepfun-ai/Step-Audio-EditX
Description: A powerful 3B-parameter, LLM-based Reinforcement Learning audio edit model excels at editing emotion, speaking style, and paralinguistics, and features robust zero-shot text-to-speech
Language: Python
License: Apache-2.0
Stars: 929
Forks: 69
Open issues: 37
Created: 2025-10-29T11:54:17Z
Pushed: 2026-04-09T02:27:46Z
Default branch: main
Fork: no
Archived: no
README:
Step-Audio-EditX
🔥🔥🔥 News!!!
- Jan 29, 2026:
- 🧩 New Model Release:
- Better performance, with an overall improvement of over 4%.
- More paralinguistic tags have been added, including `exhale`, `snort`, `inhale`, `chuckle`, `clears throat`, `giggle`.
- Welcome to try out at StepFun Audio Studio
- 💻 We release the SFT, DPO and GRPO training code.
- 🌟 Training and inference for vLLM are now supported. Thanks to the vLLM team!
- Nov 28, 2025: 🚀 New Model Release: Now supporting `Japanese` and `Korean` languages.
- Nov 23, 2025: 📊 Step-Audio-Edit-Benchmark Released!
- Nov 19, 2025: ⚙️ We release a new version of our model, which supports polyphonic pronunciation control and improves the performance of emotion, speaking style, and paralinguistic editing.
- Nov 12, 2025: 📦 We release the optimized inference code and model weights of Step-Audio-EditX (HuggingFace; ModelScope) and Step-Audio-Tokenizer(HuggingFace; ModelScope)
- Nov 07, 2025: ✨ Demo Page ; 🎮 HF Space Playground
- Nov 06, 2025: 👋 We release the technical report of Step-Audio-EditX.
Introduction
We are open-sourcing Step-Audio-EditX, a powerful 3B-parameter LLM-based Reinforcement Learning audio model specialized in expressive and iterative audio editing. It excels at editing emotion, speaking style, and paralinguistics, and also features robust zero-shot text-to-speech (TTS) capabilities.
Wechat developer group
📑 Open-source Plan
- [x] Inference Code
- [x] Online demo (Gradio)
- [x] Step-Audio-Edit-Benchmark
- [x] Model Checkpoints
- [x] Step-Audio-Tokenizer
- [x] Step-Audio-EditX
- [x] Step-Audio-EditX-Int4
- [ ] Training Code
- [x] SFT training
- [x] DPO training
- [x] GRPO training
- [ ] PPO training
- [ ] ⏳ Feature Support Plan
- [ ] Editing
- [x] Polyphone pronunciation control
- [x] More paralinguistic tags ([Cough, Crying, Stress, etc.])
- [ ] Filler word removal
- [ ] Other Languages
- [x] Japanese, Korean
- [ ] Arabic, French, Russian, Spanish, etc.
Features
- Zero-Shot TTS
- Excellent zero-shot TTS cloning for Mandarin, English, Sichuanese, and Cantonese.
- To use dialect or other languages, just add a `[Sichuanese]` / `[Cantonese]` / `[Japanese]` / `[Korean]` tag before your text.
- 🔥 Polyphone pronunciation control, all you need to do is replace the polyphonic characters with pinyin.
- [我也想过过过儿过过的生活] -> [我也想guo4guo4guo1儿guo4guo4的生活]
- Emotion and Speaking Style Editing
- Remarkably effective iterative control over emotions and styles, supporting dozens of options for editing.
- Emotion Editing : [ *Angry*, *Happy*, *Sad*, *Excited*, *Fearful*, *Surprised*, *Disgusted*, etc. ]
- Speaking Style Editing: [ *Act_coy*, *Older*, *Child*, *Whisper*, *Serious*, *Generous*, *Exaggerated*, etc.]
- Editing with more emotion and more speaking styles is on the way. Get Ready! 🚀
- Paralinguistic Editing
- Precise control over 10 types of paralinguistic features for more natural, human-like, and expressive synthetic audio.
- Supporting Tags:
- [ *Breathing*, *Laughter*, *Surprise-oh*, *Confirmation-en*, *Uhm*, *Surprise-ah*, *Surprise-wa*, *Sigh*, *Question-ei*, *Dissatisfaction-hnn* ]
- Available Tags
emotion happy Expressing happiness angry Expressing anger
sad Expressing sadness fear Expressing fear
surprised Expressing surprise confusion Expressing confusion
empathy Expressing empathy and understanding embarrass Expressing embarrassment
excited Expressing excitement and enthusiasm depressed Expressing a depressed or discouraged mood
admiration Expressing admiration or respect coldness Expressing coldness and indifference
disgusted Expressing disgust or aversion humour Expressing humor or playfulness
speaking style serious Speaking in a serious or solemn manner arrogant Speaking in an arrogant manner
child Speaking in a childlike manner older Speaking in an elderly-sounding manner
girl Speaking in a light, youthful feminine manner pure Speaking in a pure, innocent manner
sister Speaking in a mature, confident feminine manner sweet Speaking in a sweet, lovely manner
exaggerated Speaking in an exaggerated, dramatic manner ethereal Speaking in a soft, airy, dreamy manner
whisper Speaking in a whispering, very soft manner generous Speaking in a hearty, outgoing, and straight-talking manner
recite Speaking in a clear, well-paced, poetry-reading manner act_coy Speaking in a sweet, playful, and endearing manner
warm Speaking in a warm, friendly manner shy Speaking in a shy, timid manner
comfort Speaking in a comforting, reassuring manner authority Speaking in an authoritative, commanding manner
chat Speaking in a casual, conversational manner radio Speaking in a radio-broadcast manner
soulful Speaking in a heartfelt, deeply emotional manner gentle Speaking in a gentle, soft manner
story Speaking in a narrative, audiobook-style manner vivid Speaking in a lively, expressive manner
program Speaking in a show-host/presenter manner news Speaking in a news broadcasting manner
advertising Speaking in a polished, high-end commercial voiceover manner roar Speaking in a loud, deep, roaring manner
murmur Speaking in a quiet, low manner shout Speaking in a loud, sharp, shouting manner
deeply Speaking in a deep and low-pitched tone loudly Speaking in a loud and high-pitched tone
paralinguistic [sigh] Sighing sound [inhale] Inhaling sound
[laugh] Laughter sound [chuckle] Chuckling sound
[exhale] Exhaling sound [clears…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Solid new repo with 926 GitHub stars