WritingAmazon (Nova)Amazon (Nova)published May 15, 2026seen 5d

Making LLMs faster without sacrificing accuracy

Open original ↗

Captured source

source ↗

New scaling law connects LLM architecture to inference efficiency, boosting throughput up to 47% - Amazon Science

Close

Close

Social

bluesky

threads

twitter

instagram

youtube

facebook

linkedin

github

rss

Menu

Research

Research areas

Automated reasoning

Cloud and systems

Computer vision

Conversational AI

Economics

Information and knowledge management

Machine learning

Operations research and optimization

Quantum technologies

Robotics

Search and information retrieval

Security, privacy, and abuse prevention

Sustainability

Our scientific contributions

Publications

Research from our scientists and collaborators.

Conferences

Our experts present and discuss cutting-edge research at scientific meetings globally.

Research areas

Automated reasoning

Cloud and systems

Computer vision

Conversational AI

Economics

Information and knowledge management

Machine learning

Operations research and optimization

Quantum technologies

Robotics

Search and information retrieval

Security, privacy, and abuse prevention

Sustainability

Our scientific contributions

Publications

Research from our scientists and collaborators.

Conferences

Our experts present and discuss cutting-edge research at scientific meetings globally.

News & blog

The latest from Amazon researchers

Amazon Science Blog

Technical deep-dives and perspectives from our scientists.

News

Research milestones and recent achievements.

The latest from Amazon researchers

Amazon Science Blog

Technical deep-dives and perspectives from our scientists.

News

Research milestones and recent achievements.

Collaborations

Amazon Research Awards

Overview

Call for proposals

Latest news

Research stories

Recipients

Amazon Nova AI Challenge

Overview

Rules

FAQs

Teams

Research collaborations

Overview

Carnegie Mellon University

Columbia University

Hampton University

Howard University

IIT Bombay

Johns Hopkins University

Max Planck Society

MIT

Tennessee State University

University of California, Los Angeles

University of Illinois Urbana-Champaign

University of Southern California

University of Texas at Austin

Virginia Tech

University of Washington

Amazon Research Awards

Overview

Call for proposals

Latest news

Research stories

Recipients

Amazon Nova AI Challenge

Overview

Rules

FAQs

Teams

Research collaborations

Overview

Carnegie Mellon University

Columbia University

Hampton University

Howard University

IIT Bombay

Johns Hopkins University

Max Planck Society

MIT

Tennessee State University

University of California, Los Angeles

University of Illinois Urbana-Champaign

University of Southern California

University of Texas at Austin

Virginia Tech

University of Washington

Resources

Code and datasets

AGI Labs

Meet the team building useful AI agents.

Amazon Nova

Try Amazon’s frontier foundation models.

Code and datasets

AGI Labs

Meet the team building useful AI agents.

Amazon Nova

Try Amazon’s frontier foundation models.

Careers

Careers

Explore our open roles.

Amazon Scholars

Faculty research opportunities on industry-scale technical challenges.

Postdoctoral Science Program

Early-career research opportunities alongside experienced industry scientists.

Careers

Explore our open roles.

Amazon Scholars

Faculty research opportunities on industry-scale technical challenges.

Postdoctoral Science Program

Early-career research opportunities alongside experienced industry scientists.

Search

Submit Search

Conversational AI

Making LLMs faster without sacrificing accuracy

A new scaling law that relates particular architectural choices to loss helps identify models that improve throughput by up to 47% with no loss of accuracy.

By Tao Yu , Youngsuk Park

May 15, 2026

5 min read

Share

Share

Copy link

Email

X

LinkedIn

Facebook

Line

Reddit

QZone

Sina Weibo

WeChat

WhatsApp

分享到微信

x

Overview by Amazon Nova

Surefire models match or exceed LLaMA-3.2 accuracy while improving throughput by up to 47%, with gains consistent across A100 and H200 GPUs and multiple serving frameworks. The optimal MLP-to-attention ratio of LLaMA-3.2-style models is around 1.0, far lower than that of existing open-weight versions (e.g., 4.8 for LLaMA-3.2-1B).

Was this answer helpful?

Large language models (LLMs) keep getting bigger and better. But the cost of running them — generating text, answering questions, powering real-time applications — is scaling up, too. Obviously, model accuracy is important, but for real-time AI-based web applications, it can’t come at the expense of efficiency. In a paper we presented at the International Conference on Learning Representations ( ICLR ), we provide a framework for navigating this accuracy-versus-efficiency tradeoff , by connecting scaling laws directly to architectural-design decisions.

The gap in current scaling laws

In 2022, Google DeepMind announced the results of a study involving an experimental LLM called Chinchilla. The DeepMind researchers demonstrated a scaling law that enabled joint optimization of model size and training data to achieve a desired loss level, given a particular computational budget. More precisely, the law relates the model loss ( L ) to the number of model parameters ( N ) and the number of tokens in the training dataset:

The Chinchilla scaling law relates model loss (L) to parameter count (N) and training-token count but says nothing about the model's internal architecture — the gap this work addresses.

The other variables in this equation — E , A , B , α , and β — are all learnable coefficients. The DeepMind researchers did extensive experimentation to tune those coefficients. This "Chinchilla law" doesn't specify architectural choices, such as the size of the model's internal representations — the "hidden size" — or the relative number of parameters allocated to attention layers and multilayer perceptron (MLP) layers. However, two models, each with the same billion-parameter count, trained on the same data, with the same accuracy, can differ by up to 40% in inference-time throughput, depending on additional architectural choices. We set out to deduce scaling laws that can help predict those choices.

The Transformer architecture

The Transformer architecture — which lies at the heart of all LLMs — consists largely of stacked attention and MLP blocks. Attention blocks determine how much weight to give each prior token (word or word part) when updating the current token's…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Substantive research post from major lab.