RepoInclusionAI (Ant Group)InclusionAI (Ant Group)published Feb 11, 2026seen 5d

inclusionAI/Ming-omni-tts

Python

Open original ↗

Captured source

source ↗
published Feb 11, 2026seen 5dcaptured 13hhttp 200method plain

inclusionAI/Ming-omni-tts

Description: Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control

Language: Python

License: MIT

Stars: 240

Forks: 17

Open issues: 12

Created: 2026-02-11T12:18:15Z

Pushed: 2026-02-26T12:06:53Z

Default branch: main

Fork: no

Archived: no

README:

Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control

🌐Project Page |🤗 Hugging Face| 🤖 ModelScope | 🎮 Gradio Demo-zh | 🎮 Gradio Demo-en | 💬 DingTalk(钉钉)

Table of Contents

  • [Introduction](#introduction)
  • [Demo](#demo)
  • [Updates](#updates)
  • [Key Features](#-key-features)
  • [Evaluation](#evaluation)
  • [Audio Tokenizer](#audio-tokenizer)
  • [Speech Controllable Generative Tasks](#speech-controllable-generative-tasks)
  • [Audio & BGM Generation](#audio--bgm-generation)
  • [Text Normalization](#text-normalization)
  • [Model & Benchmark Downloads](#model--benchmark-downloads)
  • [Environment Preparation](#environment-preparation)
  • [Example Usage](#example-usage)
  • [Audio Reconstruction](#audio-reconstruction)
  • [Audio Generation](#audio-generative)
  • [Citation](#citation)

Introduction

Ming-omni-tts is a high-performance unified audio generation model that achieves precise control over speech attributes and enables single-channel synthesis of speech, environmental sounds, and music. Powered by a custom 12.5Hz continuous tokenizer and Patch-by-Patch compression, it delivers competitive inference efficiency (3.1Hz). Additionally, the model features robust text normalization capabilities for the accurate and natural narration of complex mathematical and chemical expressions.

🚀 Core Capabilities

  • 🔊 Fine-grained Vocal Control: The model supports precise control over speech rate, pitch, volume, emotion, and dialect through simple commands. Notably, its accuracy for Cantonese dialect control is as high as 93%, and its emotion control accuracy reaches 46.7%, surpassing CosyVoice3.
  • 🌌 Intelligent Voice Design: Features 100+ premium built-in voices and supports zero-shot voice design through natural language descriptions. Its performance on the Instruct-TTS-Eval-zh benchmark is on par with Qwen3-TTS.
  • 🎶 Immersive Unified Generation: The industry’s first autoregressive model to jointly generate speech, ambient sound, and music in a single channel. Built on a custom 12.5Hz continuous tokenizer and a DiT head architecture, it delivers a seamless, "in-the-scene" auditory experience.
  • High-efficiency Inference: Introduces a "Patch-by-Patch" compression strategy that reduces the LLM inference frame rate to 3.1Hz. This significantly cuts latency and enables podcast-style audio generation while preserving naturalness and audio detail.
  • 🧪 Professional Text Normalization: The model accurately parses and narrates complex formats, including mathematical expressions and chemical equations, ensuring natural-sounding output for specialized applications.

Demo

https://github.com/user-attachments/assets/eb0e900e-ed5e-40ca-98df-31c244939527

Updates

🚀 Key Features

Ming-omni-tts features key optimizations as follows, compared to other audio-assisted LLMs:

  • Unified Continuous Audio Tokenizer: We propose a continuous VAE-based tokenizer that integrates speech, music, and general audio into a unified latent space with 12.5 Hz frame rate, yielding competitive results across audio reconstruction and various downstream synthesis benchmarks.
  • Unified Audio Language Model for Speech, Music and Sound Generation: We present a unified, end-to-end audio language model that employs a single LLM backbone to perform joint generation of speech, music, and general sound. To enhance audio quality, the architecture is augmented with a Diffusion Head. Furthermore, we employ a patch-based generation strategy with a patch size of 4 and a look-back history of 32, enabling an optimal balance between local acoustic detail and long-range structural coherence.

Evaluation

  • Reconstruction: The 12Hz tokenizer supports high-quality reconstruction across speech, music, and sound. Its performance is comparable to existing state-of-the-art methods across key fidelity metrics.
  • Dialect Generation: Achieves 96% accuracy on WSYue-TTS-Eval and 86% WSC-TTS-Eval, outperforming CosyVoice3.
  • Emotional Expressiveness: Delivers an average accuracy of 76.7% on CV3-Eval emotional sets and 46.7% on neutral emotion sets, significantly surpassing CosyVoice3-Base (40%) to reach SOTA levels.
  • Instruction-based Voice Design: Scores 76.20% on InstructTTS-Eval-ZH. Its instruction-following capability is on par with Qwen3-TTS-VoiceDesign.
  • Zero-shot Voice Clone: Exhibits exceptional stability on Seed-tts-eval (Chinese) with a WER of 0.83%, outperforming SeedTTS and GLM-TTS.
  • Text Normalization (TN): On internal technical testsets, the model achieves a CER of 1.97% in normalized regions, delivering performance comparable to Gemini-2.5 Pro.

Audio Tokenizer

Speech metrics are evaluated on AISHELL-3(44.1khz-Chinese) and VCTK(44.1khz-English). Music metrics are evaluated on MUSDB18(44.1khz) and MUSDB18-HQ(44.1khz). Audio metrics are evaluated on AudioCaps.

Speech Controllable Generative Tasks

Zero-shot TTS

Zero-shot speech generation performance comparison on the Seed-TTS testset.

Model Institution seed-tts-eval-zh seed-tts-eval-en

WER ↓ SIM ↑ WER ↓ SIM ↑

Seed-TTS BytedanceSpeech 1.11 0.796 2.24 0.762

MaskGCT College 2.27 0.774 2.62 0.714

E2 TTS Microsoft 1.97 0.730 2.19 0.710

F5-TTS College 1.56 0.741 1.83 0.647

CosyVoice 2 Alibaba 1.45 0.748 2.57 0.652

Qwen3-Omni-30B-A3B Alibaba 1.07 – 1.39 –

CosyVoice 3-0.5B Alibaba 1.16 0.780 2.02 0.718

CosyVoice 3-1.5B Alibaba 0.71 0.775 1.45 0.695

Qwen3-TTS-25Hz-0.6B-Base Alibaba 1.18 – 1.64 –

Qwen3-TTS-25Hz-1.7B-Base Alibaba 1.10 – 1.49 –

Qwen3-TTS-12Hz-0.6B-Base Alibaba 0.92 – 1.32 –

Qwen3-TTS-12Hz-1.7B-Base Alibaba 0.77 – 1.24 –

GLM-TTS Zhipu AI 1.03 0.761 2.23 0.672

Ming-Flash-Omni-preview Ant Group 0.99 0.740 1.59 0.680

Ming-omni-tts-0.5B(ours) Ant Group 0.87 0.72 2.19 0.61

Ming-omni-tts-16.8B-A3B(ours) Ant Group 0.83 0.75 2.02 0.62

Speech Attribute Control

Model Institution Instruction success rate wer sim

speech…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New TTS repo with moderate stars