zai-org/GLM-TTS
Python
Captured source
source ↗zai-org/GLM-TTS
Description: GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning
Language: Python
License: Apache-2.0
Stars: 1020
Forks: 129
Open issues: 45
Created: 2025-12-06T04:50:56Z
Pushed: 2026-04-10T08:50:23Z
Default branch: main
Fork: no
Archived: no
README:
GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning
[中文阅读](README_zh.md)
📜 Paper | 🤗 HuggingFace | 🤖 ModelScope | 🛠️Audio.Z.AI
Model Introduction
GLM-TTS is a high-quality text-to-speech (TTS) synthesis system based on large language models, supporting zero-shot voice cloning and streaming inference. This system adopts a two-stage architecture: first, it uses LLM to generate speech token sequences, then uses Flow model to convert tokens into high-quality audio waveforms. By introducing a Multi-Reward Reinforcement Learning framework, GLM-TTS can generate more expressive and emotional speech, significantly improving the expressiveness of traditional TTS systems.
News & Updates
- [2025.12.11] 🎉 The project is officially open-sourced, featuring inference scripts and a series of model weights.
- [2025.12.17] GLM-TTS Technical Report is available on arXiv: 2512.14291.
- [Coming Soon] 2D Vocos vocoder update in progress.
- [Coming Soon] Model Weights Optimized via Reinforcement Learning
Features
- Zero-shot Voice Cloning: Clone any speaker's voice with just 3-10 seconds of prompt audio
- RL-enhanced Emotion Control: Achieve more natural emotional expression and prosody control through multi-reward reinforcement learning framework
- Streaming Inference: Support real-time streaming audio generation, suitable for interactive applications
- High-quality Synthesis: Generate natural and expressive speech with quality comparable to commercial systems
- Multi-language Support: Primarily supports Chinese, while also supporting English mixed text
- Phoneme-level Modeling: Support phoneme-level text-to-speech conversion
- Flexible Inference Methods: Support multiple sampling strategies and inference modes
Quick Start
Environment Setup
Ensure you use Python 3.10 - Python 3.12 versions.
For GPU
# Clone repository git clone https://github.com/zai-org/GLM-TTS.git cd GLM-TTS # Install dependencies pip install -r requirements.txt # Install reinforcement learning related dependencies (optional) cd grpo/modules git clone https://github.com/s3prl/s3prl git clone https://github.com/omine-me/LaughterSegmentation # Download wavlm_large_finetune.pth and place it in grpo/ckpt directory
For NPU
Obtain CANN image
# Update DEVICE according to your device (/dev/davinci[0-7]) export DEVICE=/dev/davinci7 # Update the vllm-ascend image export IMAGE=quay.io/ascend/cann:8.5.1-910b-ubuntu22.04-py3.11 docker run --rm \ --name vllm-ascend-env \ --shm-size=1g \ --device $DEVICE \ --device /dev/davinci_manager \ --device /dev/devmm_svm \ --device /dev/hisi_hdc \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -v /root/.cache:/root/.cache \ -it $IMAGE bash
# Clone repository git clone https://github.com/zai-org/GLM-TTS.git cd GLM-TTS pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/" python -m pip install -r requirements_npu.txt --no-build-isolation # Install reinforcement learning related dependencies (optional) cd grpo/modules git clone https://github.com/s3prl/s3prl git clone https://github.com/omine-me/LaughterSegmentation # Download wavlm_large_finetune.pth and place it in grpo/ckpt directory
Download Pre-trained Models
We support downloading the complete model weights (including Tokenizer, LLM, Flow, Vocoder, and Frontend) from HuggingFace or ModelScope.
# Create model directory mkdir -p ckpt # Option 1: Download from HuggingFace pip install -U huggingface_hub huggingface-cli download zai-org/GLM-TTS --local-dir ckpt # Option 2: Download from ModelScope pip install -U modelscope modelscope download --model ZhipuAI/GLM-TTS --local_dir ckpt
Running Inference Demo
Command Line Inference
python glmtts_inference.py \ --data=example_zh \ --exp_name=_test \ --use_cache \ # --phoneme # Add this flag to enable phoneme capabilities.
Shell Script Inference
bash glmtts_inference.sh
Interactive Web Interface
python -m tools.gradio_app
System Architecture
Overview
GLM-TTS adopts a two-stage design: in the first stage, a large language model (LLM) based on Llama architecture converts input text into speech token sequences; in the second stage, the Flow Matching model converts these token sequences into high-quality mel-spectrogram, and finally generates audio waveforms through a vocoder. The system supports zero-shot voice cloning by extracting speaker features from prompt audio without fine-tuning for specific speakers.
Fine-grained Pronunciation Control (Phoneme-in)
For scenarios demanding high pronunciation accuracy, such as educational assessments and audiobooks, GLM-TTS introduces the Phoneme-in mechanism to address automatic pronunciation ambiguity in polyphones (e.g., "行" which can be read as *xíng* or *háng*) and rare characters. This mechanism supports "Hybrid Phoneme + Text" input, enabling precise, targeted control over specific vocabulary pronunciation.
- Hybrid Training
During training, random G2P (Grapheme-to-Phoneme) conversion is applied to parts of the text. This strategy compels the model to adapt to hybrid input sequences, preserving its ability to understand pure text while enhancing generalization for phoneme inputs.
- Targeted Inference
Inference follows a G2P -> Table Lookup Replacement -> Hybrid Input workflow: 1. Global Conversion: Obtain the complete phoneme sequence for the input text. 2. Dynamic Replacement: Using a "Dynamic Controllable Dictionary," automatically identify polyphones or rare characters and replace them with specified target phonemes. 3. Hybrid Generation: Feed the combination of replaced phonemes and original text into GLM-TTS as a hybrid input. This ensures…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10New TTS model with strong stars.