WritingAnthropicAnthropicpublished May 7, 2026seen 2d

Interpretability

Open original ↗

Captured source

source ↗
published May 7, 2026seen 2dcaptured 12hhttp 200method plain

Interpretability Research \ Anthropic Back to Overview Interpretability

The mission of the Interpretability team is to discover and understand how large language models work internally, as a foundation for AI safety and positive outcomes. Research teams: Alignment Economic Research Interpretability Societal Impacts

Safety through understanding It's very challenging to reason about the safety of neural networks without understanding them. The Interpretability team’s goal is to be able to explain large language models’ behaviors in detail, and then use that to solve a variety of problems ranging from bias to misuse to autonomous harmful behavior.

Multidisciplinary approach Some Interpretability researchers have deep backgrounds in machine learning – one member of the team is often described as having started mechanistic interpretability, while another was on the famous scaling laws paper. Other members joined after careers in astronomy, physics, mathematics, biology, data visualization, and more.

Natural Language Autoencoders: Turning Claude’s thoughts into text Interpretability May 7, 2026 AI models like Claude talk in words but think in numbers. In this study, we train Claude to translate its thoughts into human-readable text.

Interpretability Apr 2, 2026 Emotion concepts and their function in a large language model All modern language models sometimes act like they have emotions. What’s behind these behaviors? Our interpretability team investigates. Interpretability Jan 19, 2026 The assistant axis: situating and stabilizing the character of large language models Who is the Assistant? We investigate the character that most modern language models inhabit when interacting with users. Interpretability Oct 29, 2025 Signs of introspection in large language models Can Claude access and report on its own internal states? This research finds evidence for a limited but functional ability to introspect. Interpretability Aug 1, 2025 Persona vectors: Monitoring and controlling character traits in language models AI models represent character traits as patterns of activations within their neural networks. By extracting "persona vectors" for traits like sycophancy or hallucination, we can monitor personality shifts and mitigate undesirable behaviors.

Publications Search

Date Category Title May 7, 2026 Interpretability Natural Language Autoencoders: Turning Claude’s thoughts into text Apr 2, 2026 Interpretability Emotion concepts and their function in a large language model Mar 13, 2026 Interpretability A “diff” tool for AI: Finding behavioral differences in new models Jan 19, 2026 Interpretability The assistant axis: situating and stabilizing the character of large language models Oct 29, 2025 Interpretability Signs of introspection in large language models Aug 1, 2025 Interpretability Persona vectors: Monitoring and controlling character traits in language models May 29, 2025 Interpretability Open-sourcing circuit tracing tools Mar 27, 2025 Interpretability Tracing the thoughts of a large language model Mar 13, 2025 Alignment Auditing language models for hidden objectives Feb 20, 2025 Interpretability Insights on Crosscoder Model Diffing

See more

Join the Research team

See open roles