WritingAnthropicAnthropicpublished Mar 13, 2026seen 2d

Captured source

source ↗
published Mar 13, 2026seen 2dcaptured 9hhttp 200method plain

A “diff” tool for AI: Finding behavioral differences in new models \ Anthropic Interpretability A “diff” tool for AI: Finding behavioral differences in new models Mar 13, 2026 Read the paper

Every time a new AI model is released, its developers run a suite of evaluations to measure its performance and safety. These tests are essential, but they are somewhat limited. Because these benchmarks are human-authored, they can only test for risks we have already conceptualized and learned to measure. This approach to safety is inherently reactive . It’s effective at catching known problems, but by definition, it's incapable of discovering “unknown unknowns”—the novel, emergent behaviors that pose some of the most subtle risks in new models. Auditing a new model from scratch is like being handed a million lines of code and told to “find the security flaws.” It’s an almost impossible task when you don’t know what you’re looking for. In software engineering, whenever a program is updated, developers face this exact problem of identifying a small, critical change within a vast sea of code. This is why “ diff ” tools were invented. No programmer would ever audit a million lines from scratch to approve an update; instead, they review only the 50 lines that have actually changed, as directed by their diff tool. In recent years, AI safety researchers have started to apply this same principle to neural networks. This is known as model diffing . Previous work has shown that model diffing is a powerful way to understand how models change during fine-tuning—for instance, to understand chat model behavior , reveal hidden backdoors , or find undesirable emergent behaviors . Our new Anthropic Fellows research project extends model diffing to its most challenging and general use case: comparing models with entirely different architectures. By building a generic diff tool for AI models, we can stop searching for a needle in a haystack, and instead let the comparison automatically point us to potentially dangerous behavioral differences. It's important to note that this method is not a silver bullet. A single diff can surface thousands of unique features (the basic units into which we decompose the model), and only a small fraction of these may correspond to meaningful behavioral risks. However, by acting as a high-recall screening tool, it allows us to identify areas in which the models may diverge. Among the thousands of candidates our tool flagged, we've identified and validated several concepts that act like switches for specific model behaviors. 1 For example, we discovered: A “Chinese Communist Party Alignment” feature found in the Qwen3-8B and DeepSeek-R1-0528-Qwen3-8B models. This controls pro-government censorship and propaganda in these Chinese-developed models, and is absent in the American models we compared them against. An “American Exceptionalism” feature found in Meta’s Llama-3.1-8B-Instruct. It controls the model’s tendency to generate assertions of US superiority, a control absent in the Chinese model it was compared against. A “Copyright Refusal Mechanism” feature exclusive to OpenAI’s GPT-OSS-20B. It controls the model’s tendency to refuse to provide copyrighted material, a behavior absent in the model it was compared against.

To be clear, while our method identifies these model-exclusive features, it does not determine their origin. Such behaviors could be the result of deliberate training decisions on the part of the model developers, or they could emerge indirectly and unintentionally from the data the model was trained on. (We focused on open-source language models in this research as this was an Anthropic Fellows project.) A bilingual dictionary for AI models Imagine you're the final editor for an award-winning encyclopedia. A team of writers has just handed you the complete manuscript for next year’s edition. The vast majority of the content is identical to the current, trusted version, but they’ve added new entries to reflect recent scientific and cultural developments. Your job is to vet this final product. To do this efficiently, you wouldn't re-read the entire encyclopedia. Instead, you’d use a change tracker to isolate only the new entries, because these added sections are the only place new errors could have been introduced. This is model diffing in a nutshell. Specifically, this approach is known as “base-vs-finetune model diffing”. It's the perfect tool for when a new model is a modified version of a trusted previous one. But we could raise the complexity. Imagine your company is releasing a new edition for a different country, adapting the American encyclopedia for a French audience. This new edition is mostly composed of the same trusted concepts from the original, but to make it relevant, the writers have added new articles on French history, culture, and political philosophy. These articles don’t exist in the original. As an editor, your primary goal is still the same: you want to use a change tracker to see the new articles, since these hold the highest risk for errors and bias. But in this case, your old tool is useless, because you need one that can work across languages. This much more difficult challenge is akin to the problem of “cross-architecture model diffing”: comparing two models with different origins and different internal “languages”. The original research tool for this kind of diffing, a standard crosscoder , is like a basic bilingual dictionary. It’s good at matching existing words, knowing that “sun” in English is “ soleil ” in French. But it has a major flaw: it's so focused on finding connections that it struggles to find words that are unique to one language. When it encounters a word like the French dépaysement (the specific feeling of being in a foreign country), it tries to force an imperfect translation like ”disorientation.” By calling it a match, the tool wrongly signals to the editor, “this isn’t new; we’ve seen it before,” causing them to overlook a new article that requires careful review. To solve this, we built a better bilingual dictionary: the Dedicated Feature Crosscoder (DFC) . Instead of one big dictionary that tries to match everything, our DFC is architecturally designed with three distinct sections: A shared dictionary : This is the main bilingual dictionary, mapping all the concepts that both languages understand, like “sun” ( soleil ) or “water” ( eau ). A "French-only" section : This is…

Excerpt shown — open the source for the full document.