Rime voice models now available on Together AI
Captured source
source ↗Rime voice models now available on Together AI
⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →
Introducing Together AI's new look →
🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →
⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →
📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →
🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →
All blog posts
Model Library
Published 12/18/2025
Rime voice models now available on Together AI
High-performance enterprise TTS (text-to-speech) models with deterministic pronunciation and production-grade latency on dedicated and scalable infrastructure.
Authors
Arielle Fidel, Rajas Bansal, Sahil Yadav, Rishabh Bhargava, Sonny Khan
Table of contents
40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...
Links in this article
Arcana v2 Mist v2 TTS Documentation X Discord We're hiring! Get Notified
Summary
Two enterprise-grade Rime models on Together AI: Arcana v2 for expressivity, Mist v2 for pronunciation control Deterministic pronunciation: Define a word once via API, it renders the same across calls, channels, and voices Proven at scale: Over a billion conversations powered for multi-national telecom, financial services, healthcare companies, and more Dedicated GPU endpoints on Together AI : Co-located with LLM and STT behind a single API and control plane
A voice agent can be correct and still feel broken. Customers judge it like a phone call: if it hesitates, sounds synthetic, or mispronounces a key term, trust collapses before they can evaluate reasoning. In production, that experience comes down to a real-time loop: STT (speech-to-text) models transcribe speech, the LLM decides what to say, and TTS (text-to-speech) speaks the response. At scale, teams stitch that loop across multiple vendors, so latency, reliability, observability, and ultimately what the customer hears become difficult to manage end-to-end. Starting today on Together AI, the AI Native Cloud, we're adding Rime Arcana v2 and Mist v2 to the Together Model Library, bringing proprietary TTS models into the same API, authentication, and observability surface you already use for LLM and speech workloads. Arcana v2 delivers expressive, conversational voices trained on real customer service interactions, with 40+ voices across multiple languages and regional dialects for quality-critical scenarios. Mist v2 brings deterministic pronunciation control to high-volume production environments, reaching about 225ms time-to-first-audio on Together AI dedicated endpoints—you define how a term sounds once via API and it renders consistently across all voices, flows, and channels. Both run as dedicated endpoints on a single cloud alongside your LLM and STT workloads, so your end-to-end voice stack operates on one production platform — instead of being split across multiple providers.
Rime Arcana v2 multilingual
English and Spanish code switching
Play
Pause
0:00
0:09
"The model learns natural breathing, fillers, and backchannel cues, y cambia al español de forma natural siguiendo el ritmo de conversaciones reales en producción."
Try now
Arcana v2: Expressivity for enterprise conversations Arcana v2 is deployed today from high-growth startups to Fortune 500s as part of their production infrastructure. Across these environments, customers report measurable gains including 15% lift in sales at a national restaurant chain, a 75% reduction in call abandonment at a telecom provider, and a 10% increase in call success rates. Trained on the largest proprietary dataset of full-duplex conversational speech data Arcana v2 is trained on real conversations with everyday people — not audiobooks, podcasts, or voiceover announcers. The model learns natural breathing, fillers, backchannel cues, and conversational pacing from production conversations. Callers recognize these patterns and stay in the automated flow longer, improving completion and containment rates. 40+ voices and regional dialects Arcana v2 ships with more than 40 voices across English, Spanish, French, and German. English includes 18 voices spanning U.K., Australian, and Southern US accents. Spanish includes four primary and three bilingual voices. Everyday words match local usage automatically. For example, "schedule" is pronounced "SHED-ule" in U.K. English and "SKED-ule" in U.S. English.
Rime Arcana v2
Real-time conversation
Play
Pause
0:00
0:09
"Gosh that's a tough one. Hmmm. Let's see here."
Try now
Mist v2: Deterministic pronunciation at production scale Mist v2 is designed for high-volume production environments where pronunciation accuracy must be guaranteed across millions of calls. It already powers tens of millions of production calls each month for customer service and IVR systems where downtime or quality regression has direct revenue and compliance impact.. Deterministic pronunciation control Most TTS models guess pronunciation on each generation. Mist v2 is deterministic. You define how a word should sound once through the API, and that pronunciation holds across more than 40 voices, flows, and channels. No retraining and no per vendor hacks. When your agent mispronounces a product name, drug, or acronym, you correct it once and the fix applies everywhere. Deterministic pronunciation configuration for Mist v2 is available today through our Sales team for production deployments; contact Sales to enable it for your environment. English and Spanish with advanced pronunciation control Mist v2 supports English and Spanish with deterministic pronunciation control. You specify how brand names, medication names, or technical terms should sound through the API, and Mist renders them consistently at conversational latency. If you need deterministic pronunciation at scale in Mist v2, contact Sales to enable it for your environment. Proven at scale Mist v2 serves tens of millions of calls monthly in production customer service and IVR environments. These are full-scale deployments where downtime or quality regression has direct revenue and compliance impact, not limited pilots. Production-grade latency for conversational agents Mist v2 reaches about 225ms p50 time-to-first-audio on Together AI dedicated endpoints. Voice agents need total end-to-end latency under 700ms to feel conversational,…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New model availability, moderate interest