RepoInclusionAI (Ant Group)InclusionAI (Ant Group)published Sep 29, 2025seen 5d

inclusionAI/Ming-UniAudio

Python

Open original ↗

Captured source

source ↗
published Sep 29, 2025seen 5dcaptured 15hhttp 200method plain

inclusionAI/Ming-UniAudio

Description: Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

Language: Python

License: MIT

Stars: 448

Forks: 30

Open issues: 8

Created: 2025-09-29T03:23:18Z

Pushed: 2025-11-27T02:51:18Z

Default branch: main

Fork: no

Archived: no

README:

Ming-UniAudio

📝Technical Report |🌐Project Page |🤗 Hugging Face| 🤖 ModelScope

Table of Contents

  • [Introduction](#introduction)
  • [Updates](#updates)
  • [Key Features](#key-features)
  • [Evaluation](#evaluation)
  • [Speech Tokenizer](#speech-tokenizer)
  • [Speech Understanding](#speech-understanding)
  • [Speech Generation](#speech-generation)
  • [Speech Editing](#speech-editing)
  • [Model & Benchmark Downloads](#model--benchmark-downloads)
  • [Environment Preparation](#environment-preparation)
  • [Example Usage](#example-usage)
  • [SFT](#sft)
  • [Citation](#citation)
  • [Join Us](#join-us)

Introduction

Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon Ming-Lite-Omni. Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification.

  • 🔥 First unified continuous speech tokenizer for both understanding and generation tasks: MingTok-Audio
  • 🔥 First Speech LLM with unifed continuous tokenizer for both understanding and generation: Ming-UniAudio
  • 🔥 First universal free-form speech editing model for various semantic and acoustic editing task without any timestamp condition: Ming-UniAudio-Edit
  • 🔥 First benchmark for free-form speech editing: Ming-Freeform-Audio-Edit-Benchmark

Updates

Key Features

Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs:

  • Unified Continuous Speech Tokenizer: Ming-UniAudio proposes a unified continuous speech tokenizer MingTok-Audio based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks
  • Unified Speech Language Model for Generation and Understanding: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-quality speech synthesis.
  • Instruction-Guided Free-Form Speech Editing: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with Ming-Freeform-Audio-Edit, the first open-source evaluation set for such tasks.

Evaluation

In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale.

Speech Tokenizer

Comparison of reconstruction performance across different acoustic tokenizers. The best results are in bold.

System FrameRate SEED-ZH SEED-EN

PESQ↑ SIM↑ STOI↑ PESQ↑ SIM↑ STOI↑

MiMo-Audio-Tokenizer 25 2.71 0.89 0.93 2.43 0.85 0.92

GLM4-Voice-Tokenizer 12.5 1.06 0.33 0.61 1.05 0.12 0.60

Baichuan-Audio-Tokenizer 12.5 1.84 0.78 0.86 1.62 0.69 0.85

XY-Tokenizer 12.5 2.27 0.77 0.90 2.14 0.82 0.90

Mimi 75 2.05 0.73 0.89 2.01 0.77 0.89

XCodec2.0 50 2.19 0.80 0.92 2.37 0.82 0.93

BigCodec 80 2.26 0.81 0.92 2.22 0.80 0.91

MingTok-Audio(ours) 50 4.21 0.96 0.98 4.04 0.96 0.98

Speech Understanding

ASR performance comparison on various audio benchmark datasets. The best results are in bold.

Datasets Model Performance

aishell2-ios LS-clean Hunan Minnan Guangyue Chuanyu Shanghai

Understanding ASR Kimi-Audio 2.56 1.28 31.93 80.28 41.49 6.69 60.64

Qwen2.5 Omni 2.75 1.80 29.31 53.43 10.39 7.61 32.05

Qwen2 Audio 2.92 1.60 25.88 123.78 7.59 7.77 31.73

Ming-UniAudio-16B-A3B(ours) 2.84 1.62 9.80 16.50 5.51 5.46 14.65

Context ASR performance comparison on various audio benchmark datasets.

Datasets Model Performance

Speech-English

WER | NE-WER | NE-FNR

Dialogue-English

WER | NE-WER | NE-FNR

Speech-Mandarin

WER | NE-WER | NE-FNR

Dialogue-Mandarin

WER | NE-WER | NE-FNR

Understanding

Context ASR

Qwen2-Audio 11.49 | 27.27 | 35.08 13.99 | 33.02 | 32.92 9.92 | 24.10 | 30.02 7.00 | 22.76 | 26.17

Baichuan-Audio 7.52 | 5.87 | 4.55 5.66 | 10.01 | 3.64 2.16 | 6.65 | 2.35 2.96 | 11.48 | 3.94

Kimi-Audio 2.90 | 6.68 | 8.01 4.67 | 13.50 | 11.31 1.95 | 11.13 | 15.28 2.90 | 15.91 | 16.68

Baichuan-Omni-1.5 8.16 | 7.69 | 6.53 9.91 | 14.40 | 5.54 2.98 | 8.39 | 4.71 5.00 | 16.83 | 7.84

Qwen2.5-Omni-3B 3.99 | 7.80 | 9.69 4.83 | 14.36 | 12.85 2.13 | 10.55 | 14.11 3.12 | 15.07 | 15.17

Qwen2.5-Omni-7B 3.96 | 7.38 | 8.72 5.32 | 11.83 | 9.24 1.84 | 9.80 | 12.19 2.40 | 14.06 | 13.17

Ming-UniAudio-16B-A3B-Edit(ours) 4.00 | 3.56 | 3.69 5.34 | 8.73 | 2.53 1.58 | 5.98 | 2.40 3.04 | 9.50 | 1.48

Speech Generation

Performance comparison on various audio benchmark datasets. The best results are in bold.

Datasets Model Performance

Seed-zh WER(%) Seed-zh SIM Seed-en WER(%) Seed-en SIM

Generation Seed-TTS 1.12 0.80 2.25 0.76

MiMo-Audio 1.96 - 5.37 -

Qwen3-Omni-30B-A3B-Instruct 1.07 - 1.39 -

Ming-Omni-Lite…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

New repo with 448 stars, solid but not breakout