What does this model signal mean?

Tencent Hunyuan published tencent/Unified_Audio_Schema. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license other · 16 HF downloads · Tencent's unified framework for audio representation and processing tasks.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

Tencent Hunyuan Model: tencent/Unified_Audio_Schema

Captured source

source ↗

Hugging Face/huggingface.co/tencent/Unified_Audio_Schema

tencent/Unified_Audio_Schema model card

Source ↗

published Apr 3, 2026seen Jun 6captured Jun 11http 200method plaintask audio-text-to-textlicense otherparams 8.3Bdownloads 16likes 11

Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

Unified Audio Schema is a novel holistic framework for audio supervision that disentangles and restructures supervision across transcription, paralinguistics, and non-linguistic events.

📄 Paper | 💻 GitHub

This repository provides our model checkpoints trained using Unified Audio Schema. For the complete codebase, please refer to the corresponding GitHub repository.

Model Details

| Attribute | Value | |:----------|:------| | Input Modality | Text and audio | | Output Modality | Text and audio | | Base LLM | Qwen2.5-7B | | Audio Encoder | AuT encoder | | Input Audio Representation Frame Rate | 12.5 Hz | | Output Audio Token Codebook Size | 8,192 | | Output Audio Token Frame Rate | 25 Hz |

Notes:

The model supports interleaved text and audio input/output, enabling flexible multimodal interactions.
Speech waveform reconstruction for generated audio tokens relies on the StableToken decoder.

Quick Start

Installation

git clone --recursive https://github.com/Tencent/Unified_Audio_Schema.git
cd Unified_Audio_Schema && pip install -r requirements.txt

Download Checkpoints

# Model weights
huggingface-cli download tencent/Unified_Audio_Schema --local-dir checkpoints/Unified_Audio_Schema

# StableToken decoder (required for speech waveform reconstruction)
huggingface-cli download tencent/StableToken --local-dir checkpoints/StableToken

Inference

import torch
import torchaudio
from src.model import UASAudio

model = UASAudio(
model_path="checkpoints/Unified_Audio_Schema",
audio_decoder_path="checkpoints/StableToken/decoder",
device="cuda" if torch.cuda.is_available() else "cpu",
)

dialogue_system_prompt = (
"User will provide you with a speech instruction. Do it step by step. "
"First, think about the instruction and respond in a interleaved manner, "
"with 13 text token followed by 52 audio tokens."
)

messages = [
{"role": "system", "content": dialogue_system_prompt},
{
"role": "user",
"content": [
{"type": "audio", "audio": "assets/give_me_a_brief_introduction_to_the_great_wall.wav"},
],
},
{"role": "assistant", "content": None},
]

generation_config = {
"max_new_tokens": 4096,
"temperature": 0.7,
"repetition_penalty": 1.05,
"top_p": 0.9,
"do_sample": True
}

_, text, audio_tokens = model(messages, **generation_config)
print(text)

if len(audio_tokens) > 0:
audio_array, sampling_rate = model.tokens_to_audio(audio_tokens)
torchaudio.save("response.wav", audio_array, sampling_rate)

Supported Scenarios

Our model can be applied to a wide range of audio understanding and generation tasks, including:

Text-input conversation
Speech-input conversation
Automatic Speech Recognition (ASR)
Audio captioning
Text-to-Speech (TTS)

For more runnable examples, please refer to `example_usage.ipynb` in the GitHub repository.

Evaluation Highlights

UAS-Audio demonstrates strong performance on audio understanding, ASR, and TTS benchmarks.

Audio Understanding

| Model | MMSU (Percep.) | MMSU (Reason.) | MMSU (Overall) | MMAR (Speech) | MMAR (Sound) | MMAR (Music) | MMAR (Overall) | MMAU (Speech) | MMAU (Sound) | MMAU (Music) | MMAU (Overall) | Avg. | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Kimi-Audio | 44.8 | 75.7 | 59.8 | 58.5 | 49.7 | 33.0 | 48.0 | 62.2 | 75.7 | 66.8 | 68.2 | 58.7 | | Qwen2.5-Omni | 42.7 | 77.6 | 58.1 | 59.9 | 58.8 | 40.8 | 56.7 | 70.6 | 78.1 | 65.9 | 71.5 | 62.1 | | Step-Audio2 | 42.9 | 73.2 | 57.6 | 61.2 | 54.6 | 42.2 | 56.8 | 68.2 | 79.3 | 68.4 | 72.7 | 61.9 | | Ours | 55.7 | 77.4 | 66.2 | 66.0 | 58.8 | 45.2 | 60.1 | 67.0 | 70.0 | 71.3 | 69.4 | 65.2 |

ASR & TTS

| Model | ASR (LS-clean) | ASR (AISHELL-1) | TTS (SeedTTS-en) | TTS (SeedTTS-zh) | | :--- | :---: | :---: | :---: | :---: | | Qwen2.5-Omni | - | - | 2.3 | 1.4 | | Step-Audio2 | 1.9 | 1.0 | 2.1 | 3.2 | | MiMo-Audio | 3.8 | 1.8 | 5.4 | 2.0 | | Ours | 2.2 | 2.3 | 1.7 | 1.4 |

Citation

If you find Unified Audio Schema or our model useful for your research, please cite:

@misc{zhang2026transcriptionunifiedaudioschema,
title={Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs},
author={Linhao Zhang and Yuhan Song and Aiwei Liu and Chuhan Wu and Sijun Zhang and Wei Jia and Yuan Liu and Houfeng Wang and Xiao Zhou},
year={2026},
eprint={2604.12506},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.12506},
}

@inproceedings{song2026stabletoken,
title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient Speech{LLM}s},
author={Yuhan Song and Linhao Zhang and Chuhan Wu and Aiwei Liu and Wei Jia and Houfeng Wang and Zhou Xiao},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=17DNmdQ9aU}
}

License

This project is licensed under the [License Term of Unified_Audio_Schema](LICENSE).

Notability

notability 3.0/10

Very low traction repo/model release