Further Notes on Our Recent Research on AI Delegation and Long-Horizon Reliability
Captured source
source ↗Further Notes on Our Recent Research on AI Delegation and Long-Horizon Reliability - Microsoft Research
Skip to main content
Research
Publications Code & data People Microsoft Research blog
Artificial intelligence Audio & acoustics Computer vision Graphics & multimedia Human-computer interaction Human language technologies Search & information retrieval
Data platforms and analytics Hardware & devices Programming languages & software engineering Quantum computing Security, privacy & cryptography Systems & networking
Algorithms Mathematics
Ecology & environment Economics Medical, health & genomics Social sciences Technology for emerging markets
Academic programs Events & academic conferences Microsoft Research Forum
Behind the Tech podcast Microsoft Research blog Microsoft Research Forum Microsoft Research podcast
About Microsoft Research Careers & internships People Emeritus program News & awards Microsoft Research newsletter
Africa AI for Science AI Frontiers Asia-Pacific Cambridge Health Futures India Montreal New England New York City Redmond
Applied Sciences Mixed Reality & AI - Cambridge Mixed Reality & AI - Zurich
Register: Research Forum
Microsoft Security Azure Dynamics 365 Microsoft 365 Microsoft Teams Windows 365
Microsoft AI Azure Space Mixed reality Microsoft HoloLens Microsoft Viva Quantum computing Sustainability
Education Automotive Financial services Government Healthcare Manufacturing Retail
Find a partner Become a partner Partner Network Microsoft Marketplace Software companies
Blog Microsoft Advertising Developer Center Documentation Events Licensing Microsoft Learn Microsoft Research
View Sitemap
Return to Blog Home Microsoft Research Blog
Our recent paper, “ LLMs Corrupt Your Documents When You Delegate ”, has generated discussion about the reliability of AI systems in delegated workflows. We appreciate the interest in this work and want to clarify several important points about what the paper does—and does not—claim.
The research aims to develop robust evaluation methods for long-horizon delegated and collaborative tasks. More broadly, this work reflects an ongoing effort to better understand the gap between strong benchmark performance and certain real-world tasks. Using a controlled evaluation methodology, we examine how well information is preserved across these extended workflows. Within this constrained setting, we observe that models can accumulate fidelity degradation over repeated edits. Note however, that current production systems can mitigate these effects through verification loops, orchestration, and domain-specific tooling.
Our goal is not to argue against the use of AI systems in professional workflows, but rather to identify where current systems need further research and engineering to help make them more trustworthy collaborators. This benchmark is intended as a diagnostic tool for examining delegation patterns, not a measure of overall model capability, task success, or user outcomes.
Main results
The paper evaluates a specific interaction pattern we call delegated work—situations where a user entrusts an AI system to carry out multi-step modifications to important artifacts such as documents, spreadsheets, code, or structured files with limited human verification between steps.
We use chained transformation-and-inversion tasks that evaluate whether semantic content is preserved accurately across extended delegated workflows. Our evaluation uses domain-specific semantic parsing to focus on meaningful changes to the underlying artifact rather than superficial formatting or stylistic differences. The errors we report thus correspond to degradation in the underlying semantic content but, our measure of “corruption” did not include task completion or user satisfaction.
Using this methodology, we find that current frontier models can introduce sparse but consequential errors during long-horizon workflows, and that these errors may accumulate over repeated interactions. Across the evaluated settings, strong state-of-the-art models showed roughly a 19–34% degradation in artifact fidelity over 20 delegated iterations. Notably, Python workflows generally exhibited stronger robustness under extended delegated interactions, with less than 1% degradation on average.
Azure AI Foundry Labs
Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.
Azure AI Foundry
Opens in a new tab
Methodological limitations
DELEGATE-52 was intentionally designed as a stress test for long-horizon delegated execution. The benchmark evaluates whether systems preserve artifact integrity across extended sequences of transformations and inversions.
The study focuses specifically on delegated execution with limited human intervention between steps. It does not attempt to measure the full range of real-world AI deployments, many of which involve substantially more oversight, verification, and workflow structure.
The paper also evaluated a simplified agentic harness with tool use capabilities such as Python execution and file operations. While this setup did not eliminate the observed degradation, it should not be interpreted as representative of production-grade systems optimized for specific workflows or enterprise domains.
Implications
We believe the primary implication of this work is that reliable long-horizon delegation remains an important open research and engineering challenge.
The results suggest that strong short-horizon benchmark performance alone may not guarantee dependable delegated execution over extended workflows. At the same time, the findings should not be interpreted as evidence that AI systems lack practical value in real-world work today.
In practice, many deployed AI systems combine models with specialized harnesses, orchestration layers, retrieval systems, verification procedures, memory mechanisms, and human oversight designed to improve reliability and deliver useful user outcomes despite underlying model limitations. We expect continued improvements in models, workflow-aware training, memory systems, and production-grade agentic harnesses to further reduce these failure modes over time.
Opens in a new tab
Related publications
LLMs Corrupt Your Documents When You Delegate
Meet the authors
Philippe Laban
Senior Researcher
Learn more
Tobias Schnabel
Principal Researcher
Learn more
Jennifer Neville
Partner Research Manager
Learn more
Research…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Research note on AI reliability from Microsoft
Microsoft has a writing signal matching evals and quality, infrastructure.