OpenBMB/UltraEval-Audio
Python
Captured source
source ↗OpenBMB/UltraEval-Audio
Description: Your faithful, impartial partner for audio evaluation — know yourself, know your rivals. 真实评测,知己知彼。
Language: Python
License: Apache-2.0
Stars: 303
Forks: 24
Open issues: 3
Created: 2024-11-11T09:41:43Z
Pushed: 2026-06-10T07:12:04Z
Default branch: main
Fork: no
Archived: no
README: 
A Unified Framework for Comprehensive Evaluation of Audio Foundation Models
中文 | English | 💬Discord | UltraEval-Audio Paper
v1.1 Highlights
> - Popular model replication: Added replication support for popular models, including replication result showcases and one-click replication commands (see replication/). > - Isolated Runtime: Introduced an isolated inference mechanism. Model-specific dependencies are installed/managed automatically; inference runs in the isolated environment and communicates with the main evaluation process via IPC, eliminating dependency conflicts. > - Specialized model evaluation support: Added specialized audio models for TTS, ASR, and Audio Codec, further expanding evaluation coverage.
Overview
🚀Exceptional Experience with UltraEval-Audio🚀
UltraEval-Audio — The world's first open-source framework supporting both speech understanding and speech generation evaluation, specifically designed for large audio models. It aggregates 34 authoritative benchmarks, covering four major domains: speech, sound, medicine, and music, supporting 10 languages and 12 task categories. With UltraEval-Audio, you will experience unprecedented convenience and efficiency:
- Direct Replication of Popular Models 🔬: Provides detailed [replication documentation and commands](./replication/), ensuring you can easily reproduce evaluation results of open-source models with complete transparency and reproducibility.
- One-Click Benchmark Management 📥: Say goodbye to tedious manual downloading and data processing. UltraEval-Audio automates it all, letting you easily acquire well-known benchmark datasets (e.g., Librispeech, TED-LIUM, Seed-TTS-Eval).
- Built-in Evaluation Tools ⚙️: No need to hunt for evaluation tools. UltraEval-Audio binds datasets with commonly used official evaluation methods (e.g., WER, WER-ZH, BLEU, G-Eval) to ensure alignment between datasets and metrics.
- Powerful and Flexible 🛠️: Supports preview testing, random sampling, error retries, and resume-from-breakpoint, ensuring a flexible and controllable evaluation process while boosting efficiency and accuracy.
- Seamless Integration of Custom Datasets 💼: Supports not only public benchmarks but also powerful custom dataset integration, allowing rapid application in various engineering scenarios.
- Easy Integration with Existing Systems 🔗: With excellent extensibility and standardized design, UltraEval-Audio seamlessly connects with your existing evaluation pipelines, simplifying project management and unifying output results.

Changelog🔥
- [2026/06/10]
- Support [Qwen3-ASR](replication/qwen3_asr.md) evaluation (
qwen3-asr-1.7b,qwen3-asr-0.6b), with replication results and commands for English, Chinese, and Chinese dialect ASR benchmarks. - [2026/04/20]
- Support [Fish Speech S2 Pro](replication/fishaudio-s2-pro.md) evaluation, including Seed-TTS-Eval and MiniMax multilingual TTS benchmarks (22 languages)
- [2026/02/03]
- Support [Qwen3-TTS](replication/qwen3_tts.md) evaluation
- GPU parallel acceleration for faster evaluation/inference
- Usage: add
--use_model_pooland--workersto enable multi-GPU parallel inference, e.g. python audio_evals/main.py --dataset --model --use_model_pool --workers 4- [2026/01/19]
- Support Step-Audio-R1.1 evaluation, with replication report: [Step-Audio-R1.1](replication/step-audio-r1_1.md)
- [2025/12/31]
- release v1.1 🎉🎉🎉
- Add replication docs for popular models: [CosyVoice2](replication/CosyVoice2.md), [CosyVoice3](replication/CosyVoice3.md), [GLM-TTS](replication/GLM-TTS.md), [IndexTTS2](replication/IndexTTS2.md), [VoxCPM](replication/VoxCPM.md)
- support Isolated Runtime offline inference
- support TTS、ASR、Audio Codec specific task audio model
- [2025/12/04]
- Support [Qwen3-Omni](replication/qwen3_omni.md), update [Kimi-Audio](replication/kimi-audio.md)
- [2025/12/02]
- 🌟 Added [Replication Results and Command Documentation](./replication/): To better support the open-source community, we have detailed the evaluation process and results of current open-source models, ensuring the evaluation process is completely transparent and reproducible.
- Support [Long-TTS-Eval](registry/dataset/long-tts-eval.yaml) dataset, see alignment details in [Long-TTS-Eval](./replication/Long-TTS-Eval.md)
- Support [MGM-Omni TTS](registry/model/mgm_omni.yaml) model, see alignment details in [MGM-Omni](./replication/MGM-Omni.md)
- [2025/10/30]
- Support VoxCPM TTS model:
--model voxcpm-tts--model voxcpm-vc - Use
uvto accelerate model dependency installation 🚀 - [2025/10/17]
- [Support seed-tts-eval dataset](docs/seed-tts-eval4voice_clone.md)
- [2025/05/22]
- Use audio quality metrics
- [2025/05/12]
- Support Qwen2.5-Omni
qwen2.5-omni-audio, qwen2.5-omni-speech, Kimi-Audio-7B-Instructkimiaudio, kimiaudio-speechmodels, and update Audio Understanding Leaderboard - [2025/05/8]
- Faster resume evaluation,
-r/--resumeparameter, automatically searches for the latest breakpoint result if no file is specified - Support evaluation starting from inference file,
--infer-fileparameter, allows direct evaluation from inference file without regeneration - [2025/03/23]
- Added support for step-audio model evaluation and ranking
- Ranking details: [leaderboard.md](assets/leaderboard.md)
- Evaluation support: Step-Audio-Chat
- [2025/03/04]
- Support [resume evaluation](docs/Procedures for Restarting an Incomplete Evaluation.md), command line parameter
--resume $checkpoint_res_file - glm-4-voice service deployment, supports UltraEval-Audio evaluation, see details at GLM-4-Voice
- Parallel evaluation support, command line parameter
--workers $num_workers - [2025/01/13] release v1.0
Leaderboard
Audio…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New audio eval repo, moderate traction.