WritingOpenAIOpenAIpublished Oct 1, 2024seen 6d

Introducing the Realtime API

Open original ↗

Captured source

source ↗
published Oct 1, 2024seen 6dcaptured 2dhttp 200method exa

Introducing the Realtime API | OpenAI

October 1, 2024

Introducing the Realtime API

Developers can now build fast speech-to-speech experiences into their applications

Share

Update on August 28, 2025: We announced the general availability of the Realtime API. Learn more here.

---

Update on February 3, 2025: We no longer limit the number of simultaneous sessions on the Realtime API. Please refer to our docs⁠ for the latest rate limits on the Realtime API.

---

Update on October 30, 2024: We've added five new voices with greater range and expressiveness. Cached pricing is now also available for text and audio inputs, lowering the price to $2.50/1M cached text input tokens and $20/1M cached audio input tokens. Learn more here⁠.

---

Update on October 17, 2024: Audio inputs and outputs are now available in the Chat Completions API. Get started here⁠.

---

Today, we're introducing a public beta of the Realtime API, enabling all paid developers to build low-latency, multimodal experiences in their apps. Similar to ChatGPT’s Advanced Voice Mode, the Realtime API supports natural speech-to-speech conversations using the six preset voices⁠ already supported in the API.

We’re also introducing audio input and output in the Chat Completions API⁠ to support use cases that don’t require the low-latency benefits of the Realtime API. With this update, developers can pass any text or audio inputs into GPT‑4o⁠ and have the model respond with their choice of text, audio, or both.

From language apps and educational software to customer support experiences, developers have already been leveraging voice experiences to connect with their users. Now with the Realtime API and soon with audio in the Chat Completions API, developers no longer have to stitch together multiple models to power these experiences. Instead, you can build natural conversational experiences with a single API call.

How it works

Previously, to create a similar voice assistant experience, developers had to transcribe audio with an automatic speech recognition model like Whisper⁠, pass the text to a text model for inference or reasoning, and then play the model’s output using a text-to-speech⁠ model. This approach often resulted in loss of emotion, emphasis and accents, plus noticeable latency. With the Chat Completions API, developers can handle the entire process with a single API call, though it remains slower than human conversation. The Realtime API improves this by streaming audio inputs and outputs directly, enabling more natural conversational experiences. It can also handle interruptions automatically, much like Advanced Voice Mode in ChatGPT.

Under the hood, the Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT‑4o. The API supports function calling⁠, which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context. For example, a voice assistant could place an order on behalf of the user or retrieve relevant customer information to personalize its responses.

Powering customer support agents, language learning assistants, and more

As part of our iterative deployment strategy, we’ve been testing the Realtime API with a handful of partners to gather feedback while we build. A couple of promising early use cases include:

Healthify, a nutrition and fitness coaching app, uses the Realtime API to enable natural conversations with its AI coach Ria, while involving human dietitians when needed for personalized support.

Speak, a language learning app, uses Realtime API to power its role-play feature, encouraging users to practice conversations in a new language.

Availability & pricing

The Realtime API will begin rolling out today in public beta to all paid developers. Audio capabilities in the Realtime API are powered by the new GPT‑4o modelgpt-4o-realtime-preview.

Audio in the Chat Completions API will be released in the coming weeks, as a new modelgpt-4o-audio-preview. Withgpt-4o-audio-preview, developers can input text or audio into GPT‑4o and receive responses in text, audio, or both.

The Realtime API uses both text tokens and audio tokens. Text input tokens are priced at $5 per 1M and $20 per 1M output tokens. Audio input is priced at $100 per 1M tokens and output is $200 per 1M tokens. This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be the same price.

Safety & privacy

The Realtime API uses multiple layers of safety protections to mitigate the risk of API abuse, including automated monitoring and human review of flagged model inputs and outputs. The Realtime API is built on the same version of GPT‑4o that powers Advanced Voice Mode in ChatGPT, which we carefully assessed using both automated and human evaluations, including evaluations according to our Preparedness Framework⁠, detailed in the GPT‑4o System Card⁠. The Realtime API also leverages the same audio safety infrastructure we built for Advanced Voice Mode, which our testing shows has helped to reduce the potential for harm.

It is against our usage policies⁠ to repurpose or distribute output from our services to spam, mislead, or otherwise harm others – and we actively monitor for potential abuse. Our policies also require developers to make it clear to their users that they are interacting with AI, unless it's obvious from the context.

Prior to launch, we tested the Realtime API with our external red teaming network and found that the Realtime API didn’t introduce any high-risk gaps not covered by our existing mitigations. As with all API services, the Realtime API is subject to our Enterprise privacy⁠ commitments. We do not train our models on the inputs or outputs used in this service without your explicit permission.

Getting started

Developers can start building with the Realtime API over the coming days in the Playground⁠, or by using our docs⁠ and the reference client⁠.

We’ve also worked with LiveKit⁠ and Agora⁠ to create client libraries of audio components like echo cancellation, reconnection, and sound isolation, and Twilio⁠ to integrate the Realtime API with Twilio’s Voice APIs⁠ which enable developers to seamlessly build, deploy and connect AI virtual agents to customers via voice calls.

What’s next

As we work towards general availability, we’re actively collecting feedback to improve the Realtime API. Some of the…

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

Major API release from OpenAI, high relevance.