What does this writing signal mean?

Amazon (Nova) Writing: Improving quality and robustness in LLM-based text-to-speech systems

Captured source

source ↗

amazon.science/amazon.science/blog/improving-quality-and-robustness-in-llm-based-text-to-speech-systems

Improving quality and robustness in LLM-based text-to-speech systems

Source ↗

published Apr 1, 2026seen Jun 5captured Jun 7http 200method plain

Improving quality and robustness in LLM-based text-to-speech systems - Amazon Science

Social

bluesky

threads

twitter

instagram

youtube

facebook

github

rss

Research

Research areas

Automated reasoning

Cloud and systems

Computer vision

Conversational AI

Economics

Information and knowledge management

Machine learning

Operations research and optimization

Quantum technologies

Robotics

Search and information retrieval

Security, privacy, and abuse prevention

Sustainability

Our scientific contributions

Publications

Research from our scientists and collaborators.

Conferences

Our experts present and discuss cutting-edge research at scientific meetings globally.

Research areas

Automated reasoning

Cloud and systems

Computer vision

Conversational AI

Economics

Information and knowledge management

Machine learning

Operations research and optimization

Quantum technologies

Robotics

Search and information retrieval

Security, privacy, and abuse prevention

Sustainability

Our scientific contributions

Publications

Research from our scientists and collaborators.

Conferences

Our experts present and discuss cutting-edge research at scientific meetings globally.

News & blog

The latest from Amazon researchers

Amazon Science Blog

Technical deep-dives and perspectives from our scientists.

News

Research milestones and recent achievements.

The latest from Amazon researchers

Amazon Science Blog

Technical deep-dives and perspectives from our scientists.

News

Research milestones and recent achievements.

Collaborations

Amazon Research Awards

Overview

Call for proposals

Latest news

Research stories

Recipients

Amazon Nova AI Challenge

Overview

Rules

FAQs

Teams

Research collaborations

Overview

Carnegie Mellon University

Columbia University

Hampton University

Howard University

IIT Bombay

Johns Hopkins University

Max Planck Society

MIT

Tennessee State University

University of California, Los Angeles

University of Illinois Urbana-Champaign

University of Southern California

University of Texas at Austin

Virginia Tech

University of Washington

Resources

Code and datasets

AGI Labs

Meet the team building useful AI agents.

Amazon Nova

Try Amazon’s frontier foundation models.

Code and datasets

AGI Labs

Meet the team building useful AI agents.

Amazon Nova

Try Amazon’s frontier foundation models.

Careers

Explore our open roles.

Amazon Scholars

Faculty research opportunities on industry-scale technical challenges.

Postdoctoral Science Program

Early-career research opportunities alongside experienced industry scientists.

Careers

Explore our open roles.

Amazon Scholars

Faculty research opportunities on industry-scale technical challenges.

Postdoctoral Science Program

Early-career research opportunities alongside experienced industry scientists.

Submit Search

Conversational AI

Improving quality and robustness in LLM-based text-to-speech systems

Low-rank adaptation, data augmentation, and chain-of-thought reasoning are among the techniques enabling accent-free polyglot outputs, improved expressiveness, and reliable synthesis.

By Ammar Abbas

April 1, 2026

5 min read

Copy link

Facebook

Line

QZone

Sina Weibo

WeChat

分享到微信

Overview by Amazon Nova

Accent-free polyglot voice cloning is achieved through locale-specific data augmentation and low-rank adaptation (LoRA) fine-tuning, enabling cloned voices to speak target languages with native-like pronunciation without loss of speaker identity. Expressiveness is enhanced through classifier-free guidance (CFG), which generates synthetic reference audio samples with improved prosodic styles — delivering 5%–20% quality improvements across nine locales spanning English, French, Italian, German, and Spanish. Reliability is improved through chain-of-thought reasoning, guardrails, agentic regeneration, and smart data filtering, reducing critical errors to less than one second per hour on long-form text by predicting phoneme sequences and duration before generation

Was this answer helpful?

Text-to-speech models based on large language models (LLMs) have gotten very good at producing natural-sounding speech, even in voices cloned from short audio files. But some problems with these models still persist. One is accent leakage in polyglot text to speech. It should be possible to transfer a voice recorded in English to French, German, or Spanish with the correct accent and without loss of voice identity. But with most systems, the reference speaker's native accent leaks into the target language, or the target language's accent overwrites characteristics of the speaker’s voice.

It should be possible to transfer a voice recorded in English to another language — say, French — with the correct accent and without loss of voice identity (left) . But with many systems, the reference speaker's native accent leaks into the target language (right) .

Expressiveness is another challenge, including the laughs, sighs, hesitations, and other indications of emotion that make speech engaging. And then there’s reliability. Unlike traditional text-to-speech (TTS) systems, LLM-based systems are autoregressive, meaning they generate speech tokens one at a time, without explicitly modeling duration. This can cause hallucinated repetitions, unexpected cutoffs, and inconsistent pronunciation. At Amazon, we're working to address all these issues.

Mitigating accent leakage in polyglot TTS

We use a locale-specific data augmentation approach to address the problem of accent leakage. Specifically, we use low-rank adaptation (LoRA) to fine-tune our polyglot models on data that is heavily weighted toward target locales. This also allows us to do accent-free polyglot voice cloning: the cloned voice speaks the target language with native-like pronunciation but...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Substantive research from major lab.

Amazon (Nova) has a writing signal matching data demand, evals and quality, infrastructure.

Data demand Evals and quality Infrastructure