What does this writing signal mean?

Anthropic Writing: Trustworthy Agents

Captured source

anthropic.com/anthropic.com/research/trustworthy-agents

published Apr 9, 2026seen Jun 9captured Jun 11http 200method plain

Trustworthy agents in practice \ Anthropic Policy Trustworthy agents in practice Apr 9, 2026

AI “agents” represent the latest major shift in how people and organizations are using AI. A couple of years ago, AI models were only broadly available as chatbots—simple question-and-answer machines. Now, through products like Claude Code and Claude Cowork , AI models can do much more: they can write and execute code, manage files, and complete tasks that span multiple applications. This represents a new frontier for governance. Agents are already making real productivity gains for our customers and inside Anthropic . But the autonomy that makes agents useful also introduces a range of new risks. Agents act with less human oversight, so there is more room for them to misread users’ intent and take actions with unintended consequences. Agents are also targets for “prompt injection” cyberattacks, which try to trick models into taking costly actions that they otherwise wouldn’t. As agents become more capable and as businesses trust them with more consequential actions, we expect both of these risks to intensify. Last August, we published our framework for building trustworthy agents , which guides how we navigate this tension. It’s built on five core principles: keeping humans in control, aligning with human values, securing agents’ interactions, maintaining transparency, and protecting privacy. In this post, we explain how agents work, describe how those principles play out in specific product decisions, and point to where industry, standards bodies, and governments can build the shared infrastructure the field needs. How agents work We define an agent as an AI model that directs its own processes and tool use when accomplishing a task—that is, deciding for itself how to achieve what users want, rather than following a fixed script. The practical difference between this and a chatbot is that an agent operates in a self-directed loop: it plans, acts, observes the result, adjusts, and repeats until the task is done or it needs to check in for human input. Here’s an example of what we mean. If you were to ask Claude in Claude Cowork to submit receipts from a business trip, it would plan the steps one-by-one (transcribe each photo, pull the amount and vendor, categorize the expense, submit it through your company's system), then work through them in sequence. If a hotel charge got flagged for exceeding the nightly cap, Claude might notice not just that the submission failed but that it doesn't know what the cap is, or what other rules might apply. So it might pause to ask whether it should pull the expense policy from your company's shared drive before trying again. With your go-ahead, it would fold what it learns into the plan and carry on, continuing until the task is done or it hits something else that needs your input. How is Claude able to do this? An agent is built from four components, and each one is both a source of capability and a potential point of oversight: The model. This is the “intelligence” that makes tasks possible. That intelligence is the product of our training process, which shapes both what the model knows and how it reasons and behaves. A harness. This refers to the instructions, and the guardrails, that the model operates under. In our example above, the harness might tell Claude to flag anything over a hundred dollars, or to never submit expenses without user confirmation. Tools. These are the services and applications the model can use, like your email, calendar, or expense software. Without tools, Claude can read the receipt but not file it. An environment. This is where the agent runs—i.e., whether it’s set up in Claude Code, Claude Cowork, or some other product—and which files, websites, or systems it can access. The same agent on a corporate laptop inside a company network will have different data access, and different stakes, than it would on a personal phone.

Most AI policy conversation today centers on the model, and understandably so. The model is where core capabilities come from, and as our most recent release showed, a single generation can meaningfully shift what agents are able to do. But agents’ behavior depends on all four layers working together. A well-trained model can still be exploited through a poorly configured harness, an overly permissive tool, or an exposed environment. This is why the safeguards we and others build need to account for them all. Our principles in practice Building agents that are both useful and trustworthy requires making careful product decisions. Our framework lays out five principles for doing so. Below, we walk through examples drawn from three: human control, alignment with user expectations, and security. Our other two principles—transparency and privacy—run through each. Designing for human control In our framework, we outlined the core tension with agents: to be useful, they need to work autonomously, but to keep them secure, humans still need to retain meaningful control over how they work. The most direct way that users stay in control of Claude is by deciding what Claude can and can't do. In Claude.ai and Claude Desktop, users can choose which tools to enable, and can configure permissions (e.g., always allow, needs approval, block) for each action Claude takes. This means users can, for example, decide it's always safe for Claude to read their calendar, but still require approval before sending someone an invitation. This approach is intuitive for simple tasks. But when a task requires dozens of actions, repeated prompts can become a source of friction, and users sometimes tune them out. In Claude Code, we introduced a new feature, Plan Mode, to address this gap. Rather than asking for approval for each action one-by-one, Claude shows the user its intended plan of action up-front. The user can review, edit, and approve the whole thing before anything happens—and can still intervene at any point during its execution. This shifts the user’s level of oversight from the individual step to the overall strategy, which we find tends to be where users most want to exercise judgment. We need to think about more complex patterns of use, too. Increasingly, agents in products like Claude Code hand off some of their work to subagents —other "Claudes" working in parallel on different parts of a task. Subagents raise new questions about how users can understand and steer workflows that are no...

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable research post on agent alignment