What does this writing signal mean?

InclusionAI (Ant Group) published Ming-Omni-TTS: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control. This talking signal gives public context for research themes, product direction, policy, or launch framing. High-signal details: Notable unified TTS model with control · Ming-Omni-TTS: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control | INCLUSION AI Skip to main content GITHUB 🤗 Hugging Face ｜ 🤖.... onlylabs links this event to 1 captured evidence page and 6 related writing signals.

InclusionAI (Ant Group) Writing: Ming-Omni-TTS: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control

Captured source

source ↗

inclusion-ai.org/inclusion-ai.org/blog/ming-omni-tts

Ming-Omni-TTS: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control

Source ↗

published Mar 4, 2026seen Jun 5captured Jun 11http 200method plain

Ming-Omni-TTS: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control | INCLUSION AI

Skip to main content GITHUB 🤗 Hugging Face ｜ 🤖 ModelScope

The Introduction Video of Ming-Omni-TTS

🚀 Featured Abilities

Ming-omni-tts is a high-performance unified audio generation model that achieves precise control over speech attributes and enables single-channel synthesis of speech, environmental sounds, and music. Powered by a custom 12.5Hz continuous tokenizer and Patch-by-Patch compression, it delivers competitive inference efficiency (3.1Hz). Additionally, the model features robust text normalization capabilities for the accurate and natural narration of complex mathematical and chemical expressions.

🔊 Fine-grained Vocal Control: Enables precise control over speech rate, pitch, volume, emotion, and dialects via simple instructions. It achieves 93% accuracy for Cantonese and 46.7% for emotional control, outperforming CosyVoice3.

🌌 Intelligent Voice Design: Features 100+ premium built-in voices and supports zero-shot voice design through natural language descriptions. Its performance on the Instruct-TTS-Eval-zh benchmark is on par with Qwen3-TTS.

🎶 Immersive Unified Generation: The industry's first autoregressive model to jointly generate speech, ambient sound, and music in a single channel. Built on a custom 12.5Hz continuous tokenizer and a DiT head architecture, it delivers a seamless, "in-the-scene" auditory experience.

⚡ High-efficiency Inference: Introduces a "Patch-by-Patch" compression strategy that reduces the LLM inference frame rate to 3.1Hz. This significantly cuts latency and enables podcast-style audio generation while preserving naturalness and audio detail.

🧪 Professional Text Normalization: The model accurately parses and narrates complex formats, including mathematical expressions and chemical equations, ensuring natural-sounding output for specialized applications.

Model Structure

Ming-omni-tts is a unified audio language model for the generation of speech, music, and sound, based on a unified continuous audio tokenizer.

Unified Continuous Audio Tokenizer.

Unified Audio Language Model for Speech, Music and Sound Generation.

Benchmark Evaluations

Voice Control – Support Structured and Natural Command Control

Basic Attributes Control: Speed, Volume and Pitch Control for Voice Generating

Input Prompt Target Text Instruction1 TTS Result Instruction2 TTS Result 导航开始，全程二十五公里，预计需要十二分钟。语速：慢速语速：快速烟雨弥漫下，山环绕着水耸立着，水环绕着山流淌着。语速慢一点语速快一点目前共享出行市场处于高速增长阶段。音量：低音量：高北京在出行规模，城市影响力方面表现优异。音量尽量低一点音量尽量高一点他们脱掉笨重的冬衣，走起路来腰杆挺直步履轻盈。基频：低基频：高自动驾驶将大幅提升出行安全，效率。基频低一点基频高一点

Same Dialect/Cross-Dialect Control: Generating Cantonese and Sichuanese from Mandarin and Native Prompts

Instruction Input Prompt Conversion Type Target Text TTS Result 方言：广粤话广粤话 -> 广粤话佢系头大冇脑脑大生草种方言：广粤话广粤话 -> 广粤话今个周末全场货品低至五折，数量有限，卖晒就冇喇。请用广粤话表达广粤话 -> 广粤话我觉得社会企业同个人都有责任用广粤语说，越地道越好。普通话 -> 广粤话你嚟探我，我真系好感动，好耐冇见你啦！以广粤话的口语风格来表达。普通话 -> 广粤话快啲啦，唔好再拖拖拉拉，大家都等紧你开会呀方言：川渝话川渝话 -> 川渝话你要自己打扮，不穿咋个晓得穿起漂不漂亮嘛？看我们这新款多时尚。方言：川渝话川渝话 -> 川渝话赛尔号那个时候，才出来的时候，还是他那个机制，还是特别好耍的。请用川渝话表达川渝话 -> 川渝话哎，刚刚晚上想吃点啥子？煮点火锅要得。模仿川渝话的语气来表达普通话 -> 川渝话你晓不晓得？你啥我都喜欢，嗯，就是有一点不喜欢装。挑战一下用川渝话的味儿来朗读普通话 -> 川渝话你那哈屋头还有电脑，那时候就已经先进了。

Cross-Emotion Control: Cross-Emotion Synthesis Using a Single Neutral Prompt

Instruction Input Prompt Conversion Type Target Text TTS Result 情感: 高兴中性 -> 高兴 If these examinations are held orally, they may be known colloquially as "orals". 情感: 愤怒中性 -> 愤怒 I'm done arguing with you. You're not worth my time! 情感: 愤怒中性 -> 愤怒 In cities, driving speeds are set by which lane a driver is in. 情感: 悲伤中性 -> 悲伤 Everything has changed. The promises and dreams we once had are shattered. How should I face this? 情感: 高兴中性 -> 高兴 But it does not allow for adding new members to interfaces. 情感: 愤怒愤怒 -> 愤怒港湾道是每年农历新年举行的香港新春花车巡游的路线之一。情感: 悲伤悲伤 -> 悲伤我觉得自己好像在黑暗中迷失了，再也找不到出口了。情感: 高兴中性 -> 高兴我竟然抢到了陈奕迅的演唱会门票！太棒了！终于可以现场听一听他的歌声了！情感: 悲伤悲伤 -> 悲伤我们俩从一开始就君子之交，都说好啦，背信弃义出尔反尔的是她，我告诉你这件事我是受害者。表达时要悲伤一点。悲伤 -> 悲伤有些软体开发者也注意到软体度量已成为软体开发过程中的一部份。把这件事说得高兴一点。高兴 -> 高兴 I bought my first mountain bike with my own earnings, a Merida Warrior 500! Go me! 表达时，请务必流露出高兴的情感。中性 -> 高兴 I ran into a teacher I hadn't seen in years at the coffee shop today. He still remembered me, and we talked about so many fun memories.

Built-in Premium Sounds: Over 100 Built-in, High-Quality Timbres

Instruction Describe Target Text TTS Result 克隆一下灵小甄的说话腔调。销售、直播带货: 声音明亮清脆，语速轻快且充满活力，语气中带有强烈的推荐感和亲和力，典型的带货主播风格。这款产品的名字，叫变态坑爹牛肉丸。模仿灵梦的风格。虚拟恋人: 充满糖分的高甜少女音，语气娇憨任性，完美演绎了想要人陪伴时的撒娇状态。认为在中文歌曲里，夹杂几句英文就很时髦。麻烦学一下灵岩的口音新闻、客服: 声音清晰正式且专业届时会按照原定计划，与国防部签署相关以地换地协议。克隆一下灵娇的说话腔调。邻家女孩、女大学生、Vlog博主: 清甜明亮的少女音，语感轻快活泼，在讲述生活趣事时充满画面感与青春朝气，极具感染力。总裁问，刚才皮皮鲁唱的歌是谁的词谁的曲，大手笔呀。克隆一下妩媚妲己的说话腔调。妩媚角色: 声音甜美清脆，语调轻盈上扬，表现性感妩媚新娘是一位俄国公主，坐着六只驯鹿拉的雪车，从芬兰一路而来。克隆一下灵绮木的说话腔调。透着刻薄与傲慢的冷艳御姐音这就是它第二个特色——灵活的音色设计能力，你可以直接用文字描述，比如"知性女主播的声音"，它就能给你生成。要是懒得想，它还内置了一百多种精品音色，什么动漫角色、短视频配音统统搞定！克隆一下灵若虚的说话腔调。老奶奶形象，声音饱含岁月的温暖与慈爱，语速舒缓，透着对生活细节的满足感，极具治愈力。这就是它第二个特色——灵活的音色设计能力，你可以直接用文字描述，比如"知性女主播的声音"，它就能给你生成。要是懒得想，它还内置了一百多种精品音色，什么动漫角色、短视频配音统统搞定！克隆一下花小呗的说话腔调。儿童角色，声音清脆甜美，带有明显的幼态特征，语调轻快活泼这就是它第二个特色——灵活的音色设计能力，你可以直接用文字描述，比如"知性女主播的声音"，它就能给你生成。要是懒得想，它还内置了一百多种精品音色，什么动漫角色、短视频配音统统搞定！克隆一下灵浅忧的说话腔调。小男孩，声音清脆明亮，充满元气今天天气不错，要出去玩了。

Voice Design: Zero-Shot Synthesis of Custom Vocal Identities via Natural Language Descriptions

Instruction Target Text TTS Result 性别: 女童声音. 音高: 音高尖锐，持续偏高. 语速: 语速迅捷，语气急促. 音量: 音量响亮，情绪饱满. 年龄: 学龄儿童. 清晰度: 吐字清晰，发音用力. 流畅度: 表达流畅，伴强调性重复. 口音: 标准普通话. 音色质感: 童声清亮，略显尖锐. 情绪: 激动委屈，带有抗议. 语调: 声调高昂，语势急切. 性格: 急躁率真，不甘示弱. 人家从那走过，他们就说我故意偷听，还说我是小广播，我偏要广播，偏要广播偏。性别: 男性. 音高: 男性沉稳中低音. 语速: 语速舒缓，有自然停顿. 音量: 正常谈话音量. 年龄: 中老年男性. 清晰度: 吐字清晰，发音标准. 流畅度: 言语连贯，表达自然. 口音: 标准普通话. 音色质感: 音质温和，略显沧桑. 情绪: 饱含不舍与怀念，转为平静嘱托. 语调: 前段感叹意味，后段请求意味. 性格: 念旧重情，温和坦诚. 这就是天望娃娃送给我的我一直舍不得丢掉它，你替我上交了吧。性别: 男性语音特征. 音高: 男性中低音域，初始疑问时音调上扬. 语速: 整体偏快，表述急切清晰. 音量: 正常交谈音量，偶有强调加重. 年龄: 青年至中年男性. 清晰度: 吐字清晰，发音标准. 流畅度: 叙述流畅，偶有为强调而设的短暂停顿. 口音: 带有北方地区特征的普通话. 音色质感: 声音较为浑厚，略带一丝沙哑质感. 情绪: 从关切疑问过渡到解释性陈述，略显急切. 语调: 初始疑问扬起，后转为肯定叙述语调. 性格: 显得坦率直接，急于说明情况. 没有欺负这孩子呢，报告团长没人欺负他，不是怎么的，他本来是给他师父小杨上门的，回来，就说鬼鬼的鬼。性别: 女性. 音高: 女性高音，句末随情绪上扬. 语速: 语速偏缓，充满恳切感. 音量: 音量正常，激动处略有提高. 年龄: 中年女性. 清晰度: 吐字清晰，略带哭腔. 流畅度: 整体流畅，因情绪略显迟缓. 口音: 标准普通话. 音色质感: 音色略显沙哑，蕴含悲伤. 情绪: 悲伤焦虑，带有不解与恳求. 语调: 起伏较大，表达焦急质问. 性格: 情感浓烈，忧心忡忡. 我们家好容易恢复成这个样子，你明知有危险，为什么还一定要拉着杉杉？用活泼的童声带着喜悦和兴奋不间断地讲述一个有趣的故事。我有个大哥叫小王，能吃饭也能喝汤，别看他手里没武器啊，说话赛过歪白的机关枪。...

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable unified TTS model with control