WritingTogether AITogether AIpublished Nov 4, 2025seen 5d

Announcing the fastest inference for realtime voice AI agents

Open original ↗

Captured source

source ↗

Announcing the fastest inference for realtime voice AI agents

⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

All blog posts

Model Library

Published 11/4/2025

Announcing the fastest inference for realtime voice AI agents

Authors

Rajas Bansal, Sahil Yadav, Garima Dhanania, Sri Yanamandra, Charles Zedlewski, Zain Hasan, Derek Petersen, Blaine Kasten, Sonny Khan, Rishabh Bhargava

Table of contents

40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...

Links in this article

Playground Speech-to-Text Documentation Text-to-Speech Documentation Model library ‍

Summary

Streaming Whisper speech-to-text (STT): Continuous transcription over WebSocket APIs optimized for voice agents First serverless open-source text-to-speech (TTS): Orpheus (high-fidelity) and Kokoro (ultra-low latency) available through REST and WebSocket APIs without dedicated infrastructure ‍ Voxtral transcription and speaker diarization: Premium multilingual transcription model and automatic speaker identification for batch processing

Voice interfaces are one of the hallmarks of a truly AI native application. From transcription to speech-to-code to outbound calling to custom podcasts, voice makes applications engaging and productive.  But developers often have to piece together a number of specialized voice services to ship a single voice application. This tends to slow development while adding complexity, latency and cost.

We're pleased to announce the addition of a greatly expanded set of high performance, low latency voice infrastructure to our cloud. We've worked hard to provide voice services that are frontier quality, developer friendly and very low latency. With these additions, we've expanded our voice offering from transcription to a full set of building blocks that can power some or all of an application's voice pipeline. These services support real-time and batch patterns in developer-friendly serverless and dedicated form factors. ‍ Streaming speech-to-text for voice agents Streaming Whisper Traditional batch transcription waits for complete audio files. Voice agents need to process speech as it arrives, and intelligently detect when users finish speaking. We've built the industry's fastest speech-to-text API by combining optimized model inference with intelligent system design — WebSocket streaming to eliminate connection overhead, carefully tuned voice activity detection (VAD), and purpose-built infrastructure for realtime audio processing. The result: Whisper running in real time with minimal quality degradation, completing transcripts up to 35% faster than alternatives. The key is optimizing for time-to-complete-transcript, not just time-to-first-token. Voice agents need to know precisely when a user stops speaking to begin formulating responses. Our VAD tuning ensures your agent responds at the right moment, not too early (cutting users off) or too late (creating dead air).

STREAMING WHISPER

Your browser does not support the video tag.

Real-time transcription with industry-leading latency. Carefully tuned voice activity detection for natural conversation flow.

$0.0035/min

Try now

Text-to-speech: Serverless open-source models Together AI is the first cloud to provide serverless open-source text-to-speech models. No more spinning up dedicated instances for sporadic TTS needs — both models are available through REST APIs for batch generation and WebSocket APIs for realtime streaming. Orpheus TTS: Natural voice quality Orpheus delivers natural, expressive speech with multiple voice options suitable for customer-facing applications. At 187ms time-to-first-byte, it outpaces premium providers while approaching the speed of lighter models. The result: professional voice quality without sacrificing the responsiveness voice agents require.

ORPHEUS TTS

Your browser does not support the video tag.

High-fidelity voice generation with natural prosody. 187ms average time-to-first-byte—faster than premium proprietary providers.

$15/1M chars

Try now

Kokoro TTS When every millisecond counts, Kokoro delivers. With 97ms baseline TTFB, it's built for applications where response speed trumps all else. This predictable performance makes it ideal for high-volume voice agent deployments where cost and latency are critical.

KOKORO TTS

Your browser does not support the video tag.

Ultra-fast production-scale voice. 97ms time- to-first-byte—more than 2x faster than alternatives with consistent performance under load.

$4/1M chars

Try now

New audio transcriptions Two new capabilities expand our audio transcriptions API for batch processing workflows: Voxtral Mini Voxtral Mini is a higher-accuracy transcription model from Mistral AI, optimized for European languages and challenging audio conditions. Voxtral delivers measurably lower word error rates than standard Whisper — ideal for applications where transcription mistakes create liability or operational overhead.

VOXTRAL

Your browser does not support the video tag.

Premium multilingual transcription optimized for European languages and challenging audio conditions with measurably lower word error rates.

$0.0030/min

Try now

Speaker Diarization Automatically identify and label different speakers in recorded audio. Transform raw transcripts into structured conversations showing who said what and when — essential for meeting transcription, call center quality assurance, and multi-party conversation review. Built for production voice agents Three architectural decisions make Together AI's audio infrastructure uniquely suited for production voice agents: Latency: Response times that enable natural conversation Human conversation flows at a specific pace. Responses that take longer than 500ms feel unnatural. Beyond 2 seconds, users assume the system has failed. Every additional 100ms of latency measurably decreases user satisfaction and task completion rates. Our infrastructure eliminates unnecessary latency at every layer. WebSocket connections stay alive, avoiding TCP handshake overhead. Models run on the same GPU…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable performance claim for voice AI