RepoOpenBMB (MiniCPM)OpenBMB (MiniCPM)published Sep 16, 2025seen 5d

OpenBMB/VoxCPM

Python

Open original ↗

Captured source

source ↗
published Sep 16, 2025seen 5dcaptured 9hhttp 200method plain

OpenBMB/VoxCPM

Description: VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning

Language: Python

License: Apache-2.0

Stars: 28282

Forks: 3199

Open issues: 121

Created: 2025-09-16T03:41:49Z

Pushed: 2026-06-10T07:23:13Z

Default branch: main

Fork: no

Archived: no

README: VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning

English | 中文

👋 Join our community for discussion and support!

Feishu

|

Discord

VoxCPM is a tokenizer-free Text-to-Speech system that directly generates continuous speech representations via an end-to-end diffusion autoregressive architecture, bypassing discrete tokenization to achieve highly natural and expressive synthesis.

VoxCPM2 is the latest major release — a 2B parameter model trained on over 2 million hours of multilingual speech data, now supporting 30 languages, Voice Design, Controllable Voice Cloning, and 48kHz studio-quality audio output. Built on a MiniCPM-4 backbone.

✨ Highlights

  • 🌍 30-Language Multilingual — Input text in any of the 30 supported languages and synthesize directly, no language tag needed
  • 🎨 Voice Design — Create a brand-new voice from a natural-language description alone (gender, age, tone, emotion, pace …), no reference audio required
  • 🎛️ Controllable Cloning — Clone any voice from a short reference clip, with optional style guidance to steer emotion, pace, and expression while preserving the original timbre
  • 🎙️ Ultimate Cloning — Reproduce every vocal nuance: provide both reference audio and its transcript, and the model continues seamlessly from the reference, faithfully preserving every vocal detail — timbre, rhythm, emotion, and style (same as VoxCPM1.5)
  • 🔊 48kHz High-Quality Audio — Accepts 16kHz reference audio and directly outputs 48kHz studio-quality audio via AudioVAE V2's asymmetric encode/decode design, with built-in super-resolution — no external upsampler needed
  • 🧠 Context-Aware Synthesis — Automatically infers appropriate prosody and expressiveness from text content
  • Real-Time Streaming — RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by Nano-vLLM or vLLM-Omni — official vLLM omni-modal serving for VoxCPM2 with PagedAttention and an OpenAI-compatible API
  • 📜 Fully Open-Source & Commercial-Ready — Weights and code released under the [Apache-2.0](LICENSE) license, free for commercial use

🌍 Supported Languages (30) Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese

Chinese Dialect: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话

News

  • [2026.04] 🔥 We release VoxCPM2 — 2B, 30 languages, Voice Design & Controllable Voice Cloning, 48kHz audio output! Weights | Docs | Playground | Technical Report
  • [2025.12] 🎉 Open-source VoxCPM1.5 weights with SFT & LoRA fine-tuning. (🏆 #1 GitHub Trending)
  • [2025.09] 🔥 Release VoxCPM Technical Report.
  • [2025.09] 🎉 Open-source VoxCPM-0.5B weights (🏆 #1 HuggingFace Trending)

---

Contents

  • [Quick Start](#-quick-start)
  • [Installation](#installation)
  • [Python API](#python-api)
  • [CLI Usage](#cli-usage)
  • [Web Demo](#web-demo)
  • [Production Deployment](#-production-deployment-nano-vllm)
  • [Models & Versions](#-models--versions)
  • [Performance](#-performance)
  • [Fine-tuning](#%EF%B8%8F-fine-tuning)
  • [Documentation](#-documentation)
  • [Ecosystem & Community](#-ecosystem--community)
  • [Risks and Limitations](#%EF%B8%8F-risks-and-limitations)
  • [Citation](#-citation)

---

🚀 Quick Start

Installation

pip install voxcpm

> Requirements: Python ≥ 3.10 ( RTF as low as ~0.13 on NVIDIA RTX 4090 (vs ~0.3 with the standard PyTorch implementation), with support for batched concurrent requests and a FastAPI HTTP server. See the Nano-vLLM-VoxCPM repo for deployment details.

🏭 Production Serving (vLLM-Omni)

For production multi-tenant deployments, use [vLLM-Omni](https://github.com/vllm-project/vllm-omni) — the official vLLM project's omni-modal extension with native VoxCPM2 support. PagedAttention KV cache, continuous batching, and a drop-in OpenAI-compatible /v1/audio/speech endpoint.

# Install from source (latest main — vllm-omni is rapidly evolving)
uv pip install vllm==0.19.0 --torch-backend=auto
git clone https://github.com/vllm-project/vllm-omni.git && cd vllm-omni
uv pip install -e .

See the vLLM-Omni installation guide for other platforms (ROCm, XPU, MUSA, NPU) and Docker images.

# Launch an OpenAI-compatible TTS server (--omni enables omni-modal serving)
vllm serve openbmb/VoxCPM2 --omni --port 8000

# Call it from any OpenAI client
curl http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"openbmb/VoxCPM2","input":"Hello from VoxCPM2 on vLLM-Omni!","voice":"default"}' \
--output out.wav

> Built on the upstream vLLM scheduler, with batched concurrent requests, streaming chunk delivery, and multi-GPU deployment out of the box. See the VoxCPM2 example for full deployment recipes.

> Full parameter reference, multi-scenario examples, and voice cloning tips → Quick Start Guide | Usage Guide | Cookbook

---

📦 Models & Versions

| | VoxCPM2 | VoxCPM1.5 | VoxCPM-0.5B | | ------------------------------- |…

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

High-starred new repo from OpenBMB