WritingAmazon (Nova)Amazon (Nova)published Jun 3, 2026seen 5d

Ground truth is a process, not a dataset

Open original ↗

Captured source

source ↗
published Jun 3, 2026seen 5dcaptured 3dhttp 200method plain

Ground truth is a process, not a dataset - Amazon Science

Close

Close

Social

bluesky

threads

twitter

instagram

youtube

facebook

linkedin

github

rss

Menu

Research

Research areas

Automated reasoning

Cloud and systems

Computer vision

Conversational AI

Economics

Information and knowledge management

Machine learning

Operations research and optimization

Quantum technologies

Robotics

Search and information retrieval

Security, privacy, and abuse prevention

Sustainability

Our scientific contributions

Publications

Research from our scientists and collaborators.

Conferences

Our experts present and discuss cutting-edge research at scientific meetings globally.

Research areas

Automated reasoning

Cloud and systems

Computer vision

Conversational AI

Economics

Information and knowledge management

Machine learning

Operations research and optimization

Quantum technologies

Robotics

Search and information retrieval

Security, privacy, and abuse prevention

Sustainability

Our scientific contributions

Publications

Research from our scientists and collaborators.

Conferences

Our experts present and discuss cutting-edge research at scientific meetings globally.

News & blog

The latest from Amazon researchers

Amazon Science Blog

Technical deep-dives and perspectives from our scientists.

News

Research milestones and recent achievements.

The latest from Amazon researchers

Amazon Science Blog

Technical deep-dives and perspectives from our scientists.

News

Research milestones and recent achievements.

Collaborations

Amazon Research Awards

Overview

Call for proposals

Latest news

Research stories

Recipients

Amazon Nova AI Challenge

Overview

Rules

FAQs

Teams

Research collaborations

Overview

Carnegie Mellon University

Columbia University

Hampton University

Howard University

IIT Bombay

Johns Hopkins University

Max Planck Society

MIT

Tennessee State University

University of California, Los Angeles

University of Illinois Urbana-Champaign

University of Southern California

University of Texas at Austin

Virginia Tech

University of Washington

Amazon Research Awards

Overview

Call for proposals

Latest news

Research stories

Recipients

Amazon Nova AI Challenge

Overview

Rules

FAQs

Teams

Research collaborations

Overview

Carnegie Mellon University

Columbia University

Hampton University

Howard University

IIT Bombay

Johns Hopkins University

Max Planck Society

MIT

Tennessee State University

University of California, Los Angeles

University of Illinois Urbana-Champaign

University of Southern California

University of Texas at Austin

Virginia Tech

University of Washington

Resources

Code and datasets

AGI Labs

Meet the team building useful AI agents.

Amazon Nova

Try Amazon’s frontier foundation models.

Code and datasets

AGI Labs

Meet the team building useful AI agents.

Amazon Nova

Try Amazon’s frontier foundation models.

Careers

Careers

Explore our open roles.

Amazon Scholars

Faculty research opportunities on industry-scale technical challenges.

Postdoctoral Science Program

Early-career research opportunities alongside experienced industry scientists.

Careers

Explore our open roles.

Amazon Scholars

Faculty research opportunities on industry-scale technical challenges.

Postdoctoral Science Program

Early-career research opportunities alongside experienced industry scientists.

Search

Submit Search

Machine learning

Ground truth is a process, not a dataset

Automatically fact-checking long, AI-generated research reports poses new challenges — including benchmarking.

By Venkatesh Saligrama

June 3, 2026

4 min read

Share

Share

Copy link

Email

X

LinkedIn

Facebook

Line

Reddit

QZone

Sina Weibo

WeChat

WhatsApp

分享到微信

x

Overview by Amazon Nova

A new audit-then-score protocol improves accuracy on benchmarks from 60.8% to 90.9% by using AI models to challenge and refine human-generated benchmarks. The audit-then-score protocol transforms benchmarks into an evolving process, emphasizing ongoing collaboration between humans, models, and evidence. This approach highlights the need for dynamic and adaptive evaluation systems as AI capabilities advance, ensuring benchmarks remain relevant and accurate.

Was this answer helpful?

Today, the key challenge in AI isn’t only how to build better models; it’s how to build evaluation systems that can keep up. Search-augmented AI systems can now produce deep research reports — long, polished syntheses of many sources that increasingly resemble expert analysis. But those reports are useful only if their claims are supported by the underlying literature. Most existing fact-checking tools work best when a claim can be matched to a short quote or a single document. But in AI-generated research reports, a single sentence may combine evidence from several sources. It can depend on the surrounding report for context, and it might compare assertions in a way that no single source does on its own. When Amazon’s Artificial General Intelligence (AGI) group started working on the problem of evaluating AI-generated research reports, we thought that the main technical challenge would be building a stronger AI fact checker. But before you can evaluate an AI fact checker, you need a benchmark, a standardized test set used to measure performance. And in this setting, building the benchmark turned out to be at least as hard as building the model. Traditionally, we view the ground truth for a problem as a fixed dataset. But we discovered that to evaluate complex AI properly, ground truth has to become a process. We call that process audit-then-score , and we present it, together with two accompanying datasets, in a paper we recently published to arXiv . When static datasets break down

In the standard method for measuring AI performance, human experts label examples, those labels become the “ground truth” (the undisputed correct answers), and models are scored against them. To test this approach with AI-generated research reports, we recruited PhD-level specialists from fields such as computer science, control theory, education, public health, and environmental engineering. We asked them to verify claims from reports in their own specialties, mixing in a hidden set of claims whose answers we already knew. The result was sobering. In a controlled study, unassisted experts achieved only 60.8% accuracy on the hidden set of known answers. The issue was not a lack of…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Conceptual blog post, no release or traction.

Amazon (Nova) has a writing signal matching data demand, evals and quality.