Ground truth is a process, not a dataset
Captured source
source ↗Ground truth is a process, not a dataset - Amazon Science
Close
Close
Social
bluesky
threads
youtube
github
rss
Menu
Research
Research areas
Automated reasoning
Cloud and systems
Computer vision
Conversational AI
Economics
Information and knowledge management
Machine learning
Operations research and optimization
Quantum technologies
Robotics
Search and information retrieval
Security, privacy, and abuse prevention
Sustainability
Our scientific contributions
Publications
Research from our scientists and collaborators.
Conferences
Our experts present and discuss cutting-edge research at scientific meetings globally.
Research areas
Automated reasoning
Cloud and systems
Computer vision
Conversational AI
Economics
Information and knowledge management
Machine learning
Operations research and optimization
Quantum technologies
Robotics
Search and information retrieval
Security, privacy, and abuse prevention
Sustainability
Our scientific contributions
Publications
Research from our scientists and collaborators.
Conferences
Our experts present and discuss cutting-edge research at scientific meetings globally.
News & blog
The latest from Amazon researchers
Amazon Science Blog
Technical deep-dives and perspectives from our scientists.
News
Research milestones and recent achievements.
The latest from Amazon researchers
Amazon Science Blog
Technical deep-dives and perspectives from our scientists.
News
Research milestones and recent achievements.
Collaborations
Amazon Research Awards
Overview
Call for proposals
Latest news
Research stories
Recipients
Amazon Nova AI Challenge
Overview
Rules
FAQs
Teams
Research collaborations
Overview
Carnegie Mellon University
Columbia University
Hampton University
Howard University
IIT Bombay
Johns Hopkins University
Max Planck Society
MIT
Tennessee State University
University of California, Los Angeles
University of Illinois Urbana-Champaign
University of Southern California
University of Texas at Austin
Virginia Tech
University of Washington
Amazon Research Awards
Overview
Call for proposals
Latest news
Research stories
Recipients
Amazon Nova AI Challenge
Overview
Rules
FAQs
Teams
Research collaborations
Overview
Carnegie Mellon University
Columbia University
Hampton University
Howard University
IIT Bombay
Johns Hopkins University
Max Planck Society
MIT
Tennessee State University
University of California, Los Angeles
University of Illinois Urbana-Champaign
University of Southern California
University of Texas at Austin
Virginia Tech
University of Washington
Resources
Code and datasets
AGI Labs
Meet the team building useful AI agents.
Amazon Nova
Try Amazon’s frontier foundation models.
Code and datasets
AGI Labs
Meet the team building useful AI agents.
Amazon Nova
Try Amazon’s frontier foundation models.
Careers
Careers
Explore our open roles.
Amazon Scholars
Faculty research opportunities on industry-scale technical challenges.
Postdoctoral Science Program
Early-career research opportunities alongside experienced industry scientists.
Careers
Explore our open roles.
Amazon Scholars
Faculty research opportunities on industry-scale technical challenges.
Postdoctoral Science Program
Early-career research opportunities alongside experienced industry scientists.
Search
Submit Search
Machine learning
Ground truth is a process, not a dataset
Automatically fact-checking long, AI-generated research reports poses new challenges — including benchmarking.
By Venkatesh Saligrama
June 3, 2026
4 min read
Share
Share
Copy link
X
Line
QZone
Sina Weibo
分享到微信
x
Overview by Amazon Nova
A new audit-then-score protocol improves accuracy on benchmarks from 60.8% to 90.9% by using AI models to challenge and refine human-generated benchmarks. The audit-then-score protocol transforms benchmarks into an evolving process, emphasizing ongoing collaboration between humans, models, and evidence. This approach highlights the need for dynamic and adaptive evaluation systems as AI capabilities advance, ensuring benchmarks remain relevant and accurate.
Was this answer helpful?
Today, the key challenge in AI isn’t only how to build better models; it’s how to build evaluation systems that can keep up. Search-augmented AI systems can now produce deep research reports — long, polished syntheses of many sources that increasingly resemble expert analysis. But those reports are useful only if their claims are supported by the underlying literature. Most existing fact-checking tools work best when a claim can be matched to a short quote or a single document. But in AI-generated research reports, a single sentence may combine evidence from several sources. It can depend on the surrounding report for context, and it might compare assertions in a way that no single source does on its own. When Amazon’s Artificial General Intelligence (AGI) group started working on the problem of evaluating AI-generated research reports, we thought that the main technical challenge would be building a stronger AI fact checker. But before you can evaluate an AI fact checker, you need a benchmark, a standardized test set used to measure performance. And in this setting, building the benchmark turned out to be at least as hard as building the model. Traditionally, we view the ground truth for a problem as a fixed dataset. But we discovered that to evaluate complex AI properly, ground truth has to become a process. We call that process audit-then-score , and we present it, together with two accompanying datasets, in a paper we recently published to arXiv . When static datasets break down
In the standard method for measuring AI performance, human experts label examples, those labels become the “ground truth” (the undisputed correct answers), and models are scored against them. To test this approach with AI-generated research reports, we recruited PhD-level specialists from fields such as computer science, control theory, education, public health, and environmental engineering. We asked them to verify claims from reports in their own specialties, mixing in a hidden set of claims whose answers we already knew. The result was sobering. In a controlled study, unassisted experts achieved only 60.8% accuracy on the hidden set of known answers. The issue was not a lack of…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Conceptual blog post, no release or traction.
Amazon (Nova) has a writing signal matching data demand, evals and quality.