Improved Gemini audio models for powerful voice experiences
Captured source
source ↗Gemini 2.5 Native Audio upgrade, plus text-to-speech model updates
Improved Gemini audio models for powerful voice interactions
Share
x.com
Copy link
Bibo Xu
Director of Product Management
Tara Sainath
Distinguished Research Scientist
Read AI-generated summary
General summary
Google enhanced Gemini 2.5 Flash Native Audio for better live voice agents. Expect sharper function calling, robust instruction following and smoother conversations. Try live speech translation in the Google Translate app beta, rolling out now on Android in the US Mexico and India.
Summaries were generated by Google AI. Generative AI is experimental.
Bullet points
"Improved Gemini audio models for powerful voice interactions" enhance live agents and translation.
Gemini 2.5 Flash Native Audio now has sharper function calling and better instruction following.
The update allows for smoother conversations by retrieving context from previous turns.
Live speech translation in Google Translate preserves intonation and handles 70+ languages.
You can start building voice agents today with Gemini 2.5 Flash Native Audio on Vertex AI.
Summaries were generated by Google AI. Generative AI is experimental.
Basic explainer
Google made its Gemini AI better at understanding and speaking in conversations. It can now understand instructions better, have smoother conversations, and translate languages in real time. This means AI can help businesses with customer service and people can understand each other better, even if they speak different languages. You can even try out the live translation feature in the Google Translate app.
Summaries were generated by Google AI. Generative AI is experimental.
Explore other styles:
General summary
Bullet points
Basic explainer
Share
x.com
Copy link
Your browser does not support the audio element.
Listen to article
This content is generated by Google AI. Generative AI is experimental
[[duration]] minutes
Voice
Speed
Voice
Speed 0.75X 1X 1.5X 2X
Earlier this week, we introduced greater control over audio generation with an upgrade to our Gemini 2.5 Pro and Flash Text-to-Speech models . But generating expressive speech is only one side of the conversation. Today, we’re releasing an updated Gemini 2.5 Flash Native Audio for live voice agents. This update improves the model’s ability to handle complex workflows, navigate user instructions, and hold natural conversations. Gemini 2.5 Flash Native Audio is now available across Google products including Google AI Studio , Vertex AI , and has also started rolling out in Gemini Live and Search Live , bringing the naturalness of native audio to Search Live for the first time. This means you can more effectively brainstorm live with Gemini, get real-time help in Search Live, or build the next generation of enterprise-ready customer service agents. Beyond powering helpful agents, native audio unlocks new possibilities for global communication. We’re introducing live speech translation, a capability that enables streaming speech-to-speech translation for headphones. It preserves the speaker’s intonation, pacing and pitch. This beta experience is rolling out in the Google Translate app starting today. Live Voice Agents
To enable the breadth of use cases across surfaces and products, we have improved Gemini 2.5 Native Audio in three key areas: Sharper function calling: We’ve improved the model's reliability when triggering external functions. It can now more accurately identify when to fetch real-time information during a conversation and seamlessly weave that data back into the audio response, without breaking the flow. On ComplexFuncBench Audio , an eval that captures multi-step function calling with various constraints, Gemini 2.5 Native Audio leads with a score of 71.5%. Robust instruction following: The model is now better at handling complex instructions resulting in higher user satisfaction on content completeness. With a 90% adherence rate to developer instructions (up from 84%), it delivers more reliable outputs. Smoother conversations: We’ve achieved significant gains in multi-turn conversation quality. Gemini 2.5 Flash Native Audio is able to retrieve context from previous turns more effectively, creating more cohesive conversations.
The updated Gemini 2.5 Flash Native Audio’s performance against previous versions and industry competitors on ComplexFuncBench
What customers are saying Google Cloud customers are already using Gemini’s native audio capabilities to drive real business results, from mortgage processing to customer calls. “Users often forget they’re talking to AI within a minute of using Sidekick, and in some cases have thanked the bot after a long chat…New Live API AI capabilities offered through Gemini [2.5 Flash Native Audio] empower our merchants to win.” – David Wurtz, VP of Product, Shopify "By integrating the Gemini 2.5 Flash Native Audio model…we've significantly enhanced Mia's capabilities since launching in May 2025. This powerful combination has enabled us to generate over 14,000 loans for our broker partners. " – Jason Bressler, Chief Technology Officer, United Wholesale Mortgage (UWM) “Working with the Gemini 2.5 Flash Native Audio model through Vertex AI allows Newo.ai AI Receptionists to achieve unmatched conversational intelligence ... .They can identify the main speaker even in noisy settings, switch languages mid-conversation, and sound remarkably natural and emotionally expressive.” – David Yang, Co-founder, Newo.ai
Live Speech Translation Gemini now natively supports new live speech-to-speech translation capabilities designed to handle both continuous listening and two-way conversation. With continuous listening, Gemini automatically translates speech in multiple languages into a single target language. This allows you to put headphones in and hear the world around you in your language. For two-way conversation, Gemini’s live speech translation handles translation between two languages in real-time, automatically switching the output language based on who is speaking. For example, if you speak English and want to chat with a Hindi speaker, you’ll hear English translations in real-time in your headphones, while your phone broadcasts Hindi when you’re done speaking. Gemini’s live speech translation has a number of key capabilities that help in the real world: Language coverage : Translates speech in over 70 languages and…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable model improvement from major lab