inclusionAI/Ming-UniAudio
Python
Captured source
source ↗inclusionAI/Ming-UniAudio
Description: Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation
Language: Python
License: MIT
Stars: 448
Forks: 30
Open issues: 8
Created: 2025-09-29T03:23:18Z
Pushed: 2025-11-27T02:51:18Z
Default branch: main
Fork: no
Archived: no
README:
Ming-UniAudio
📝Technical Report |🌐Project Page |🤗 Hugging Face| 🤖 ModelScope
Table of Contents
- [Introduction](#introduction)
- [Updates](#updates)
- [Key Features](#key-features)
- [Evaluation](#evaluation)
- [Speech Tokenizer](#speech-tokenizer)
- [Speech Understanding](#speech-understanding)
- [Speech Generation](#speech-generation)
- [Speech Editing](#speech-editing)
- [Model & Benchmark Downloads](#model--benchmark-downloads)
- [Environment Preparation](#environment-preparation)
- [Example Usage](#example-usage)
- [SFT](#sft)
- [Citation](#citation)
- [Join Us](#join-us)
Introduction
Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon Ming-Lite-Omni. Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification.
- 🔥 First unified continuous speech tokenizer for both understanding and generation tasks: MingTok-Audio
- 🔥 First Speech LLM with unifed continuous tokenizer for both understanding and generation: Ming-UniAudio
- 🔥 First universal free-form speech editing model for various semantic and acoustic editing task without any timestamp condition: Ming-UniAudio-Edit
- 🔥 First benchmark for free-form speech editing: Ming-Freeform-Audio-Edit-Benchmark
Updates
- [ ] Support VLLM Inference
- [x] Technical Report
- [x] [ASR & TTS SFT recipes](sft/README.md)
- [x] Streaming TTS
- [x] Ming-UniAudio Blog
Key Features
Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs:
- Unified Continuous Speech Tokenizer: Ming-UniAudio proposes a unified continuous speech tokenizer MingTok-Audio based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks
- Unified Speech Language Model for Generation and Understanding: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-quality speech synthesis.
- Instruction-Guided Free-Form Speech Editing: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with Ming-Freeform-Audio-Edit, the first open-source evaluation set for such tasks.
Evaluation
In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale.
Speech Tokenizer
Comparison of reconstruction performance across different acoustic tokenizers. The best results are in bold.
System FrameRate SEED-ZH SEED-EN
PESQ↑ SIM↑ STOI↑ PESQ↑ SIM↑ STOI↑
MiMo-Audio-Tokenizer 25 2.71 0.89 0.93 2.43 0.85 0.92
GLM4-Voice-Tokenizer 12.5 1.06 0.33 0.61 1.05 0.12 0.60
Baichuan-Audio-Tokenizer 12.5 1.84 0.78 0.86 1.62 0.69 0.85
XY-Tokenizer 12.5 2.27 0.77 0.90 2.14 0.82 0.90
Mimi 75 2.05 0.73 0.89 2.01 0.77 0.89
XCodec2.0 50 2.19 0.80 0.92 2.37 0.82 0.93
BigCodec 80 2.26 0.81 0.92 2.22 0.80 0.91
MingTok-Audio(ours) 50 4.21 0.96 0.98 4.04 0.96 0.98
Speech Understanding
ASR performance comparison on various audio benchmark datasets. The best results are in bold.
Datasets Model Performance
aishell2-ios LS-clean Hunan Minnan Guangyue Chuanyu Shanghai
Understanding ASR Kimi-Audio 2.56 1.28 31.93 80.28 41.49 6.69 60.64
Qwen2.5 Omni 2.75 1.80 29.31 53.43 10.39 7.61 32.05
Qwen2 Audio 2.92 1.60 25.88 123.78 7.59 7.77 31.73
Ming-UniAudio-16B-A3B(ours) 2.84 1.62 9.80 16.50 5.51 5.46 14.65
Context ASR performance comparison on various audio benchmark datasets.
Datasets Model Performance
Speech-English
WER | NE-WER | NE-FNR
Dialogue-English
WER | NE-WER | NE-FNR
Speech-Mandarin
WER | NE-WER | NE-FNR
Dialogue-Mandarin
WER | NE-WER | NE-FNR
Understanding
Context ASR
Qwen2-Audio 11.49 | 27.27 | 35.08 13.99 | 33.02 | 32.92 9.92 | 24.10 | 30.02 7.00 | 22.76 | 26.17
Baichuan-Audio 7.52 | 5.87 | 4.55 5.66 | 10.01 | 3.64 2.16 | 6.65 | 2.35 2.96 | 11.48 | 3.94
Kimi-Audio 2.90 | 6.68 | 8.01 4.67 | 13.50 | 11.31 1.95 | 11.13 | 15.28 2.90 | 15.91 | 16.68
Baichuan-Omni-1.5 8.16 | 7.69 | 6.53 9.91 | 14.40 | 5.54 2.98 | 8.39 | 4.71 5.00 | 16.83 | 7.84
Qwen2.5-Omni-3B 3.99 | 7.80 | 9.69 4.83 | 14.36 | 12.85 2.13 | 10.55 | 14.11 3.12 | 15.07 | 15.17
Qwen2.5-Omni-7B 3.96 | 7.38 | 8.72 5.32 | 11.83 | 9.24 1.84 | 9.80 | 12.19 2.40 | 14.06 | 13.17
Ming-UniAudio-16B-A3B-Edit(ours) 4.00 | 3.56 | 3.69 5.34 | 8.73 | 2.53 1.58 | 5.98 | 2.40 3.04 | 9.50 | 1.48
Speech Generation
Performance comparison on various audio benchmark datasets. The best results are in bold.
Datasets Model Performance
Seed-zh WER(%) Seed-zh SIM Seed-en WER(%) Seed-en SIM
Generation Seed-TTS 1.12 0.80 2.25 0.76
MiMo-Audio 1.96 - 5.37 -
Qwen3-Omni-30B-A3B-Instruct 1.07 - 1.39 -
Ming-Omni-Lite…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10New repo with 448 stars, solid but not breakout