Speaking of Voxtral
Captured source
source ↗Speaking of Voxtral | Mistral AI Research Speaking of Voxtral March 23, 2026 By Mistral AI
Back to Blog
5 min read
Share this post
Copy to clipboard Copied
Today we’re releasing Voxtral TTS, our first text-to-speech model with state-of-the-art performance in multilingual voice generation. The model is lightweight at 4B parameters, making Voxtral-powered agents natural, reliable, and cost-effective at scale. Highlights. Realistic, emotionally expressive speech in 9 popular languages with support for diverse dialects.
Very low latency for time-to-first-audio.
Easily adaptable to new voices.
Available to test out in Mistral Studio .
Enterprise-grade text-to-speech, powering critical voice agent workflows.
A natural voice generation hinges on the model’s ability to not only recite but interpret a text accurately. Contextual understanding - like neutral, happy, sarcastic, etc. - determines whether the listener considers the generation accurate or robotic. Our model excels at both contextual understanding and speaker modeling: capturing how a specific person naturally speaks. Our voice adaptation goes beyond traditional read-speech by capturing a speaker’s personality, including their natural pauses, rhythm, intonation, and emotional dexterity. With its compact size, low cost and latency, and easy adaptability, Voxtral TTS gives full control and customization for enterprises looking to own their voice AI stack. Audio is the new UX. Create new interactions for collaboration and understanding only found in speech. Begin now in AI Studio with our Mistral Voices in American, British, and French dialects.
Listen and decide: can you tell the difference? Our team speaks dozens of languages in multiple dialects, we understand the importance of cultural nuance and built a model that is a reflection of us. Speech generation builds trust via natural-like rhythm, emotion, and even the use of humor. That’s why with voice emulation, we focused on authenticity and emotional expressiveness. Margaret Sanchit Angele Gustavo Khyathi Nick Yassir Patrick Original voice Margaret Model Behavior Architect English (US)
Emulation
Prompt Boy oh boy! I'm so excited for the summer. It's going to be so warm here, can't wait for swimming in the Lido and making cherry pie.
Provider 1
I prefer this
This was generated by
Provider 2
I prefer this
This was generated by
Win-rate (voice customization) 68.4% 31.6%
Average listener preference
Voxtral TTS
ElevenLabs
State-of-the-art performance. Automated metrics such as word-error-rate and audio quality scores for multilingual text-to-speech systems are unable to measure naturalness of speech. What makes speech natural is extremely nuanced and requires a deep understanding of cultural differences and typical speaking patterns. Hence, comparative human evaluations performed by native speakers are crucial. For voice agents, latency and quality are in constant tension. Human evaluations show that Voxtral TTS achieves superior naturalness compared to ElevenLabs Flash v2.5 while maintaining similar Time-to-First-Audio (TTFA). Voxtral also performs at parity with the quality of ElevenLabs v3, successfully supporting emotion-steering for more lifelike interactions.
We conducted a comparative human evaluation of Voxtral TTS and ElevenLabs v2.5 Flash in a zero-shot custom voice context. Using two recognizable voices in their native dialects for each of the 9 supported languages, 3 annotators performed a side-by-side preference test per pair on naturalness, accent adherence, and acoustic similarity to the original reference. Voxtral TTS widens the quality gap to v2.5 Flash in this zero-shot multilingual custom voice setting, highlighting the instant customizability of Voxtral TTS to any voice. Spoken natively. Trained on a large speech dataset, Voxtral TTS is built for global application. It supports state-of-the-art performance in 9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model was trained to adapt to a custom voice with a reference as little as 3s and capture not just the voice but also nuances like subtle accent, inflections, intonations and even disfluencies similar to those expressed in the reference. We offer some preset voice options in the API but it is simple to extend to your in-house voice library customizing it to the use-case, localize it to the language and accent, keep it neutral or more emotive, casual or formal, more natural and conversational or robotic. The model also demonstrates zero-shot cross-lingual voice adaptation even though it’s not explicitly trained for it. For example, the model can generate English speech with a French voice prompt and English text. The resulting speech sounds natural while adopting the accent of the provided voice prompt (in this example, the generated speech has a natural French-accented English). This makes the model useful for building cascaded speech-to-speech translation systems. Click a speaker to run cascaded speech-to-speech translation.
Paul English (US)
Marie French
Oliver English (UK)
Prompt Before we begin, I'll need to verify a few details. Can you confirm your full name and date of birth? English French Spanish German
Generated Audio Voxtral TTS
Built for low-latency streaming. Latency is critical for voice agent applications. Voxtral TTS achieves a model latency of 70ms for a typical input voice sample of 10 seconds and 500 characters, with a real-time factor (RTF) of ≈9.7x. The model natively generates up to two minutes of audio, and our API handles arbitrarily long generations with smart interleaving. Voxtral TTS architecture. The model is a transformer-based, autoregressive, flow-matching model, built on Ministral 3B . It consists of the following components: 3.4B parameters transformer decoder backbone
390M flow-matching acoustic transformer
300M neural audio codec (symmetric encoder-decoder)
The model takes a voice prompt (5 to 25 seconds) and a text prompt in 9 supported languages. For each audio frame, the transformer backbone predicts a semantic token, then the flow-matching transformer runs 16 function evaluations (NFEs) to produce the acoustic latent. We developed an in-house codec, which processes audio causally using a semantic VQ (8192 vocabulary) and an acoustic FSQ (36 dim and 21 levels) latent and produces them at 12.5Hz frame rate.
Powering enterprise voice workflows.
Voxtral TTS closes the loop on audio...
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Modest traction on Hacker News (19 points, 1 comment)