Improving quality and robustness in LLM-based text-to-speech systems
Captured source
source ↗Improving quality and robustness in LLM-based text-to-speech systems - Amazon Science
Close
Close
Social
bluesky
threads
youtube
github
rss
Menu
Research
Research areas
Automated reasoning
Cloud and systems
Computer vision
Conversational AI
Economics
Information and knowledge management
Machine learning
Operations research and optimization
Quantum technologies
Robotics
Search and information retrieval
Security, privacy, and abuse prevention
Sustainability
Our scientific contributions
Publications
Research from our scientists and collaborators.
Conferences
Our experts present and discuss cutting-edge research at scientific meetings globally.
Research areas
Automated reasoning
Cloud and systems
Computer vision
Conversational AI
Economics
Information and knowledge management
Machine learning
Operations research and optimization
Quantum technologies
Robotics
Search and information retrieval
Security, privacy, and abuse prevention
Sustainability
Our scientific contributions
Publications
Research from our scientists and collaborators.
Conferences
Our experts present and discuss cutting-edge research at scientific meetings globally.
News & blog
The latest from Amazon researchers
Amazon Science Blog
Technical deep-dives and perspectives from our scientists.
News
Research milestones and recent achievements.
The latest from Amazon researchers
Amazon Science Blog
Technical deep-dives and perspectives from our scientists.
News
Research milestones and recent achievements.
Collaborations
Amazon Research Awards
Overview
Call for proposals
Latest news
Research stories
Recipients
Amazon Nova AI Challenge
Overview
Rules
FAQs
Teams
Research collaborations
Overview
Carnegie Mellon University
Columbia University
Hampton University
Howard University
IIT Bombay
Johns Hopkins University
Max Planck Society
MIT
Tennessee State University
University of California, Los Angeles
University of Illinois Urbana-Champaign
University of Southern California
University of Texas at Austin
Virginia Tech
University of Washington
Amazon Research Awards
Overview
Call for proposals
Latest news
Research stories
Recipients
Amazon Nova AI Challenge
Overview
Rules
FAQs
Teams
Research collaborations
Overview
Carnegie Mellon University
Columbia University
Hampton University
Howard University
IIT Bombay
Johns Hopkins University
Max Planck Society
MIT
Tennessee State University
University of California, Los Angeles
University of Illinois Urbana-Champaign
University of Southern California
University of Texas at Austin
Virginia Tech
University of Washington
Resources
Code and datasets
AGI Labs
Meet the team building useful AI agents.
Amazon Nova
Try Amazon’s frontier foundation models.
Code and datasets
AGI Labs
Meet the team building useful AI agents.
Amazon Nova
Try Amazon’s frontier foundation models.
Careers
Careers
Explore our open roles.
Amazon Scholars
Faculty research opportunities on industry-scale technical challenges.
Postdoctoral Science Program
Early-career research opportunities alongside experienced industry scientists.
Careers
Explore our open roles.
Amazon Scholars
Faculty research opportunities on industry-scale technical challenges.
Postdoctoral Science Program
Early-career research opportunities alongside experienced industry scientists.
Search
Submit Search
Conversational AI
Improving quality and robustness in LLM-based text-to-speech systems
Low-rank adaptation, data augmentation, and chain-of-thought reasoning are among the techniques enabling accent-free polyglot outputs, improved expressiveness, and reliable synthesis.
By Ammar Abbas
April 1, 2026
5 min read
Share
Share
Copy link
X
Line
QZone
Sina Weibo
分享到微信
x
Overview by Amazon Nova
Accent-free polyglot voice cloning is achieved through locale-specific data augmentation and low-rank adaptation (LoRA) fine-tuning, enabling cloned voices to speak target languages with native-like pronunciation without loss of speaker identity. Expressiveness is enhanced through classifier-free guidance (CFG), which generates synthetic reference audio samples with improved prosodic styles — delivering 5%–20% quality improvements across nine locales spanning English, French, Italian, German, and Spanish. Reliability is improved through chain-of-thought reasoning, guardrails, agentic regeneration, and smart data filtering, reducing critical errors to less than one second per hour on long-form text by predicting phoneme sequences and duration before generation
Was this answer helpful?
Text-to-speech models based on large language models (LLMs) have gotten very good at producing natural-sounding speech, even in voices cloned from short audio files. But some problems with these models still persist. One is accent leakage in polyglot text to speech. It should be possible to transfer a voice recorded in English to French, German, or Spanish with the correct accent and without loss of voice identity. But with most systems, the reference speaker's native accent leaks into the target language, or the target language's accent overwrites characteristics of the speaker’s voice.
It should be possible to transfer a voice recorded in English to another language — say, French — with the correct accent and without loss of voice identity (left) . But with many systems, the reference speaker's native accent leaks into the target language (right) .
Expressiveness is another challenge, including the laughs, sighs, hesitations, and other indications of emotion that make speech engaging. And then there’s reliability. Unlike traditional text-to-speech (TTS) systems, LLM-based systems are autoregressive, meaning they generate speech tokens one at a time, without explicitly modeling duration. This can cause hallucinated repetitions, unexpected cutoffs, and inconsistent pronunciation. At Amazon, we're working to address all these issues.
Mitigating accent leakage in polyglot TTS
We use a locale-specific data augmentation approach to address the problem of accent leakage. Specifically, we use low-rank adaptation (LoRA) to fine-tune our polyglot models on data that is heavily weighted toward target locales. This also allows us to do accent-free polyglot voice cloning: the cloned voice speaks the target language with native-like pronunciation but…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Substantive research from major lab.
Amazon (Nova) has a writing signal matching data demand, evals and quality, infrastructure.