WritingMicrosoftMicrosoftpublished May 15, 2026seen 5d

Further Notes on Our Recent Research on AI Delegation and Long-Horizon Reliability

Open original ↗

Captured source

source ↗

Further Notes on Our Recent Research on AI Delegation and Long-Horizon Reliability - Microsoft Research

Skip to main content

Research

Publications Code & data People Microsoft Research blog

Artificial intelligence Audio & acoustics Computer vision Graphics & multimedia Human-computer interaction Human language technologies Search & information retrieval

Data platforms and analytics Hardware & devices Programming languages & software engineering Quantum computing Security, privacy & cryptography Systems & networking

Algorithms Mathematics

Ecology & environment Economics Medical, health & genomics Social sciences Technology for emerging markets

Academic programs Events & academic conferences Microsoft Research Forum

Behind the Tech podcast Microsoft Research blog Microsoft Research Forum Microsoft Research podcast

About Microsoft Research Careers & internships People Emeritus program News & awards Microsoft Research newsletter

Africa AI for Science AI Frontiers Asia-Pacific Cambridge Health Futures India Montreal New England New York City Redmond

Applied Sciences Mixed Reality & AI - Cambridge Mixed Reality & AI - Zurich

Register: Research Forum

Microsoft Security Azure Dynamics 365 Microsoft 365 Microsoft Teams Windows 365

Microsoft AI Azure Space Mixed reality Microsoft HoloLens Microsoft Viva Quantum computing Sustainability

Education Automotive Financial services Government Healthcare Manufacturing Retail

Find a partner Become a partner Partner Network Microsoft Marketplace Software companies

Blog Microsoft Advertising Developer Center Documentation Events Licensing Microsoft Learn Microsoft Research

View Sitemap

Return to Blog Home Microsoft Research Blog

Our recent paper, “ LLMs Corrupt Your Documents When You Delegate ”, has generated discussion about the reliability of AI systems in delegated workflows. We appreciate the interest in this work and want to clarify several important points about what the paper does—and does not—claim.

The research aims to develop robust evaluation methods for long-horizon delegated and collaborative tasks. More broadly, this work reflects an ongoing effort to better understand the gap between strong benchmark performance and certain real-world tasks. Using a controlled evaluation methodology, we examine how well information is preserved across these extended workflows. Within this constrained setting, we observe that models can accumulate fidelity degradation over repeated edits. Note however, that current production systems can mitigate these effects through verification loops, orchestration, and domain-specific tooling.

Our goal is not to argue against the use of AI systems in professional workflows, but rather to identify where current systems need further research and engineering to help make them more trustworthy collaborators. This benchmark is intended as a diagnostic tool for examining delegation patterns, not a measure of overall model capability, task success, or user outcomes.

Main results

The paper evaluates a specific interaction pattern we call delegated work—situations where a user entrusts an AI system to carry out multi-step modifications to important artifacts such as documents, spreadsheets, code, or structured files with limited human verification between steps.

We use chained transformation-and-inversion tasks that evaluate whether semantic content is preserved accurately across extended delegated workflows. Our evaluation uses domain-specific semantic parsing to focus on meaningful changes to the underlying artifact rather than superficial formatting or stylistic differences. The errors we report thus correspond to degradation in the underlying semantic content but, our measure of “corruption” did not include task completion or user satisfaction.

Using this methodology, we find that current frontier models can introduce sparse but consequential errors during long-horizon workflows, and that these errors may accumulate over repeated interactions. Across the evaluated settings, strong state-of-the-art models showed roughly a 19–34% degradation in artifact fidelity over 20 delegated iterations. Notably, Python workflows generally exhibited stronger robustness under extended delegated interactions, with less than 1% degradation on average.

Azure AI Foundry Labs

Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.

Azure AI Foundry

Opens in a new tab

Methodological limitations

DELEGATE-52 was intentionally designed as a stress test for long-horizon delegated execution. The benchmark evaluates whether systems preserve artifact integrity across extended sequences of transformations and inversions.

The study focuses specifically on delegated execution with limited human intervention between steps. It does not attempt to measure the full range of real-world AI deployments, many of which involve substantially more oversight, verification, and workflow structure.

The paper also evaluated a simplified agentic harness with tool use capabilities such as Python execution and file operations. While this setup did not eliminate the observed degradation, it should not be interpreted as representative of production-grade systems optimized for specific workflows or enterprise domains.

Implications

We believe the primary implication of this work is that reliable long-horizon delegation remains an important open research and engineering challenge.

The results suggest that strong short-horizon benchmark performance alone may not guarantee dependable delegated execution over extended workflows. At the same time, the findings should not be interpreted as evidence that AI systems lack practical value in real-world work today.

In practice, many deployed AI systems combine models with specialized harnesses, orchestration layers, retrieval systems, verification procedures, memory mechanisms, and human oversight designed to improve reliability and deliver useful user outcomes despite underlying model limitations. We expect continued improvements in models, workflow-aware training, memory systems, and production-grade agentic harnesses to further reduce these failure modes over time.

Opens in a new tab

Related publications

LLMs Corrupt Your Documents When You Delegate

Meet the authors

Philippe Laban

Senior Researcher

Learn more

Tobias Schnabel

Principal Researcher

Learn more

Jennifer Neville

Partner Research Manager

Learn more

Research…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Research note on AI reliability from Microsoft

Microsoft has a writing signal matching evals and quality, infrastructure.