inclusionAI/Ming-omni-tts
Python
Captured source
source ↗inclusionAI/Ming-omni-tts
Description: Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control
Language: Python
License: MIT
Stars: 240
Forks: 17
Open issues: 12
Created: 2026-02-11T12:18:15Z
Pushed: 2026-02-26T12:06:53Z
Default branch: main
Fork: no
Archived: no
README:
Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control
🌐Project Page |🤗 Hugging Face| 🤖 ModelScope | 🎮 Gradio Demo-zh | 🎮 Gradio Demo-en | 💬 DingTalk(钉钉)
Table of Contents
- [Introduction](#introduction)
- [Demo](#demo)
- [Updates](#updates)
- [Key Features](#-key-features)
- [Evaluation](#evaluation)
- [Audio Tokenizer](#audio-tokenizer)
- [Speech Controllable Generative Tasks](#speech-controllable-generative-tasks)
- [Audio & BGM Generation](#audio--bgm-generation)
- [Text Normalization](#text-normalization)
- [Model & Benchmark Downloads](#model--benchmark-downloads)
- [Environment Preparation](#environment-preparation)
- [Example Usage](#example-usage)
- [Audio Reconstruction](#audio-reconstruction)
- [Audio Generation](#audio-generative)
- [Citation](#citation)
Introduction
Ming-omni-tts is a high-performance unified audio generation model that achieves precise control over speech attributes and enables single-channel synthesis of speech, environmental sounds, and music. Powered by a custom 12.5Hz continuous tokenizer and Patch-by-Patch compression, it delivers competitive inference efficiency (3.1Hz). Additionally, the model features robust text normalization capabilities for the accurate and natural narration of complex mathematical and chemical expressions.
🚀 Core Capabilities
- 🔊 Fine-grained Vocal Control: The model supports precise control over speech rate, pitch, volume, emotion, and dialect through simple commands. Notably, its accuracy for Cantonese dialect control is as high as 93%, and its emotion control accuracy reaches 46.7%, surpassing CosyVoice3.
- 🌌 Intelligent Voice Design: Features 100+ premium built-in voices and supports zero-shot voice design through natural language descriptions. Its performance on the Instruct-TTS-Eval-zh benchmark is on par with Qwen3-TTS.
- 🎶 Immersive Unified Generation: The industry’s first autoregressive model to jointly generate speech, ambient sound, and music in a single channel. Built on a custom 12.5Hz continuous tokenizer and a DiT head architecture, it delivers a seamless, "in-the-scene" auditory experience.
- ⚡ High-efficiency Inference: Introduces a "Patch-by-Patch" compression strategy that reduces the LLM inference frame rate to 3.1Hz. This significantly cuts latency and enables podcast-style audio generation while preserving naturalness and audio detail.
- 🧪 Professional Text Normalization: The model accurately parses and narrates complex formats, including mathematical expressions and chemical equations, ensuring natural-sounding output for specialized applications.
Demo
https://github.com/user-attachments/assets/eb0e900e-ed5e-40ca-98df-31c244939527
Updates
- [ ] Support VLLM Inference
- [ ] Technical Report
- [x] Ming-omni-tts Blog
🚀 Key Features
Ming-omni-tts features key optimizations as follows, compared to other audio-assisted LLMs:
- Unified Continuous Audio Tokenizer: We propose a continuous VAE-based tokenizer that integrates speech, music, and general audio into a unified latent space with 12.5 Hz frame rate, yielding competitive results across audio reconstruction and various downstream synthesis benchmarks.
- Unified Audio Language Model for Speech, Music and Sound Generation: We present a unified, end-to-end audio language model that employs a single LLM backbone to perform joint generation of speech, music, and general sound. To enhance audio quality, the architecture is augmented with a Diffusion Head. Furthermore, we employ a patch-based generation strategy with a patch size of 4 and a look-back history of 32, enabling an optimal balance between local acoustic detail and long-range structural coherence.
Evaluation
- Reconstruction: The 12Hz tokenizer supports high-quality reconstruction across speech, music, and sound. Its performance is comparable to existing state-of-the-art methods across key fidelity metrics.
- Dialect Generation: Achieves 96% accuracy on WSYue-TTS-Eval and 86% WSC-TTS-Eval, outperforming CosyVoice3.
- Emotional Expressiveness: Delivers an average accuracy of 76.7% on CV3-Eval emotional sets and 46.7% on neutral emotion sets, significantly surpassing CosyVoice3-Base (40%) to reach SOTA levels.
- Instruction-based Voice Design: Scores 76.20% on InstructTTS-Eval-ZH. Its instruction-following capability is on par with Qwen3-TTS-VoiceDesign.
- Zero-shot Voice Clone: Exhibits exceptional stability on Seed-tts-eval (Chinese) with a WER of 0.83%, outperforming SeedTTS and GLM-TTS.
- Text Normalization (TN): On internal technical testsets, the model achieves a CER of 1.97% in normalized regions, delivering performance comparable to Gemini-2.5 Pro.
Audio Tokenizer
Speech metrics are evaluated on AISHELL-3(44.1khz-Chinese) and VCTK(44.1khz-English). Music metrics are evaluated on MUSDB18(44.1khz) and MUSDB18-HQ(44.1khz). Audio metrics are evaluated on AudioCaps.
Speech Controllable Generative Tasks
Zero-shot TTS
Zero-shot speech generation performance comparison on the Seed-TTS testset.
Model Institution seed-tts-eval-zh seed-tts-eval-en
WER ↓ SIM ↑ WER ↓ SIM ↑
Seed-TTS BytedanceSpeech 1.11 0.796 2.24 0.762
MaskGCT College 2.27 0.774 2.62 0.714
E2 TTS Microsoft 1.97 0.730 2.19 0.710
F5-TTS College 1.56 0.741 1.83 0.647
CosyVoice 2 Alibaba 1.45 0.748 2.57 0.652
Qwen3-Omni-30B-A3B Alibaba 1.07 – 1.39 –
CosyVoice 3-0.5B Alibaba 1.16 0.780 2.02 0.718
CosyVoice 3-1.5B Alibaba 0.71 0.775 1.45 0.695
Qwen3-TTS-25Hz-0.6B-Base Alibaba 1.18 – 1.64 –
Qwen3-TTS-25Hz-1.7B-Base Alibaba 1.10 – 1.49 –
Qwen3-TTS-12Hz-0.6B-Base Alibaba 0.92 – 1.32 –
Qwen3-TTS-12Hz-1.7B-Base Alibaba 0.77 – 1.24 –
GLM-TTS Zhipu AI 1.03 0.761 2.23 0.672
Ming-Flash-Omni-preview Ant Group 0.99 0.740 1.59 0.680
Ming-omni-tts-0.5B(ours) Ant Group 0.87 0.72 2.19 0.61
Ming-omni-tts-16.8B-A3B(ours) Ant Group 0.83 0.75 2.02 0.62
Speech Attribute Control
Model Institution Instruction success rate wer sim
speech…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New TTS repo with moderate stars