Run MiniMax Speech-02 models with an API
Captured source
source ↗Run MiniMax Speech-02 models with an API – Replicate blog
Replicate Blog
Run MiniMax Speech-02 models with an API
Posted May 6, 2025 by fofr
The Speech-02 series from MiniMax are text-to-speech models that let you create natural-sounding voices with emotional expression. The models have support for more than 30 languages.
According to the Artificial Analysis Speech Arena , Speech-02-HD is the best text-to-speech model available today, while Speech-02-Turbo comes in third.
With Replicate, you can run these models with one line of code.
Listen to MiniMax Speech-02
Here’s a sample of the Speech-02-HD model reading an adapted version of this blog post, and the prediction that generated it .
Listen to this blog post
MiniMax Speech-02 models are the best text-to-speech models available today.
Try MiniMax Speech-02
You can choose between two models: Speech-02-HD for high-quality voiceovers and audiobooks, and Speech-02-Turbo, a cheaper model that’s faster and best suited for real-time applications.
Both models can be used with a cloned voice. Voice cloning needs at least 10 seconds of audio and takes about 30 seconds to train. Each voice can be adjusted for pitch, speed, and volume to make it sound natural.
Try the models in our playground:
Speech-02-HD - For high-quality voiceovers and audiobooks
Speech-02-Turbo - For real-time applications
Voice Cloning - For creating custom voices
What you can build
These models can help you create:
Virtual assistants that sound natural
Audiobooks and voiceovers with studio-quality sound
Language learning tools with native pronunciation
Customer service bots that speak multiple languages
Content that’s accessible to people who prefer audio
Emotion control
MiniMax’s emotion control system has two ways to add feeling to voices. The auto-detect mode figures out the emotional tone from your text, while manual controls let you set the exact emotion you want. This helps your voices sound natural and engaging, whether you’re making content for entertainment, education, or business.
Language support
The models work with more than 30 languages and accents. You can use different English variants (US, UK, Australian, and Indian), Asian languages (Mandarin, Cantonese, Japanese, Korean, Vietnamese, and Indonesian), and European languages (French, German, Spanish, Portuguese, Turkish, Russian, and Ukrainian).
Voice cloning and text-to-speech with JavaScript
You can run the models with our JavaScript client . First, install the Node.js client library:
Copy
npm install replicate
Set your API token as an environment variable:
Copy
export REPLICATE_API_TOKEN = r8_9wm **********************************
(You can get an API token from your account. Keep it private.)
Import and set up the client:
Copy
import Replicate from "replicate" ;
const replicate = new Replicate ({ auth: process.env. REPLICATE_API_TOKEN , });
First, clone a voice. You’ll need an audio file in MP3, M4A, or WAV format. The file should be between 10 seconds and 5 minutes long and less than 20MB in size:
Copy
const cloneOutput = await replicate. run ( "minimax/voice-cloning" , { input: { voice_file: "path/to/your/audio.wav" , // mp3, wav, or m4a model: "speech-02-turbo" // speech-02-hd or speech-02-turbo } } );
const voiceId = cloneOutput.voice_id; console. log ( "Cloned voice ID:" , voiceId);
Now use the cloned voice for text-to-speech. You can add pauses between words using where x is the pause duration in seconds (0.01-99.99):
Copy
const input = { text: "Hello! This is a test using my cloned voice. I can add pauses between words to make the speech sound more natural." , voice_id: voiceId, // Use the cloned voice ID emotion: "happy" // Optional: happy, sad, angry, etc. };
const output = await replicate. run ( "minimax/speech-02-turbo" , { input }); console. log (output);
Voice cloning and text-to-speech with Python
You can run the models with our Python client . First, install the client and set your API token:
Copy
pip install replicate export REPLICATE_API_TOKEN = r8_9wm **********************************
Here’s how to clone a voice and use it for text-to-speech:
Copy
import replicate
Clone a voice (needs MP3, M4A, or WAV file, 10s-5min, where x is the pause duration in seconds (0.01-99.99)
output = replicate.run( "minimax/speech-02-turbo" , input = { "text" : "Hello! This is a test using my cloned voice. I can add pauses between words to make the speech sound more natural." , "voice_id" : clone_output[ "voice_id" ], "emotion" : "happy" } ) print (output)
Pricing
The text-to-speech models charge based on input and output tokens. The turbo model costs $30 per million characters, while the HD model costs $50 per million characters. One token is a single character.
Voice cloning costs $3 per voice.
Keep up to speed
Connect with our community by following us on X and joining our Discord for updates and discussions.
Happy hacking! 🎙️
Next: Easel AI is now on Replicate
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New API for a niche speech model