What does this repo signal mean?

InclusionAI (Ant Group) published inclusionAI/MingTok-Audio (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo inclusionAI/MingTok-Audio · language Python · New audio repo with moderate community interest. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

InclusionAI (Ant Group) Repo: inclusionAI/MingTok-Audio

Captured source

source ↗

GitHub/github.com/inclusionAI/MingTok-Audio

inclusionAI/MingTok-Audio repository metadata

Source ↗

published Sep 29, 2025seen Jun 5captured Jun 11http 200method plain

inclusionAI/MingTok-Audio

Language: Python

License: MIT

Stars: 88

Forks: 9

Open issues: 4

Created: 2025-09-29T03:19:13Z

Pushed: 2026-02-24T04:10:19Z

Default branch: main

Fork: no

Archived: no

README:

📝Technical Report 📖Project Page ｜🤗 Hugging Face｜ 🤖 ModelScope

Architecture

Key Features

🚀 First Unified Continuous Speech Tokenizer: the first continuous audio tokenizer to effectively integrate semantic and acoustic features, suitable for both understanding and generation tasks.
🎧 High-Quality Reconstruction: Achieve high-quality audio generation by modeling continuous features with a VAE, minimizing information loss and preserving intricate acoustic textures.
🌐 Convolution-Free Efficiency: Built on a pure causal transformer architecture, completely eliminating convolutional layers for superior efficiency and a simpler design.

Installation

pip install -r requirements.txt

Quick start

import torch
import torchaudio

from audio_tokenizer.modeling_audio_vae import AudioVAE

model = AudioVAE.from_pretrained('inclusionAI/MingTok-Audio')
model = model.cuda()
model.eval()

waveform, sr = torchaudio.load('data/1089-134686-0000.flac', backend='soundfile')
sample = {'waveform': waveform.cuda(), 'waveform_length': torch.tensor([waveform.size(-1)]).cuda()}

with torch.no_grad():
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
latent, frame_num = model.encode_latent(**sample)
output_waveform = model.decode(latent)

torchaudio.save('./1089-134686-0000_reconstruct.wav', output_waveform.cpu()[0], sample_rate=16000)

Performance

Speech reconstruction performance

Speech reconstruction performance comparison on various audio benchmark datasets. The best results are in bold.

System FrameRate SEED-ZH SEED-EN

PESQ↑ SIM↑ STOI↑ PESQ↑ SIM↑ STOI↑

MiMo-Audio-Tokenizer 25 2.71 0.89 0.93 2.43 0.85 0.92

GLM4-Voice-Tokenizer 12.5 1.06 0.33 0.61 1.05 0.12 0.60

Baichuan-Audio-Tokenizer 12.5 1.84 0.78 0.86 1.62 0.69 0.85

XY-Tokenizer 12.5 2.27 0.77 0.90 2.14 0.82 0.90

Mimi 75 2.05 0.73 0.89 2.01 0.77 0.89

XCodec2.0 50 2.19 0.80 0.92 2.37 0.82 0.93

BigCodec 80 2.26 0.81 0.92 2.22 0.80 0.91

MingTok-Audio(ours) 50 4.21 0.96 0.98 4.04 0.96 0.98

The adaptation performance for downstream ASR tasks

Understanding ASR performance comparison on various audio benchmark datasets. The best results are in bold.

Datasets Model Performance

aishell2-ios LS-clean Hunan Minnan Guangyue Chuanyu Shanghai

Understanding ASR Kimi-Audio 2.56 1.28 31.93 80.28 41.49 6.69 60.64

Qwen2.5 Omni 2.75 1.80 29.31 53.43 10.39 7.61 32.05

Qwen2 Audio 2.92 1.60 25.88 123.78 7.59 7.77 31.73

Ming-UniAudio-16B-A3B(ours) 2.84 1.62 9.80 16.50 5.51 5.46 14.65

The adaptation performance for downstream TTS tasks

Performance comparison on various audio benchmark datasets. The best results are in bold.

Datasets Model Performance

Seed-zh WER(%) Seed-zh SIM Seed-en WER(%) Seed-en SIM

Generation Seed-TTS 1.12 0.80 2.25 0.76

MiMo-Audio 1.96 - 5.37 -

Qwen3-Omni-30B-A3B-Instruct 1.07 - 1.39 -

Ming-Omni-Lite 1.69 0.68 4.31 0.51

Ming-UniAudio-16B-A3B(ours) 0.95 0.70 1.85 0.58

Acknowledgements

1. We borrowed a lot of code from X-Codec-2.0 for tokenizer training. 2. We thank the OpenAI team for developing the Whisper model and making its weights publicly available.

License and Legal Disclaimer

This code repository is licensed under the [MIT License](./LICENSE), and the Legal Disclaimer is located in the [LEGAL.md file](./LEGAL.md) under the project's root directory.

inclusionAI/MingTok-Audio

Captured source

inclusionAI/MingTok-Audio

Architecture

Key Features

Installation

Quick start

Performance

Speech reconstruction performance

The adaptation performance for downstream ASR tasks

The adaptation performance for downstream TTS tasks

Acknowledgements

License and Legal Disclaimer

Citation

Notability