Probes Catch Sleeper Agents
Captured source
source ↗Simple probes can catch sleeper agents \ Anthropic Alignment Interpretability Simple probes can catch sleeper agents Apr 23, 2024
This “Alignment Note” presents some early-stage research from the Anthropic Alignment Science team following up on our recent “ Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training ” paper. It should be treated as a work-in-progress update, and is intended for a more technical audience than our typical blog post. This research makes use of some simple interpretability techniques, and we expect to share more results from collaborations between our Alignment and Interpretability teams soon. Summary In this post we present “defection probes”: linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to “defect” and behave in accordance with a dangerous hidden goal. Using the models we trained in “ Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training ”, we show that linear detectors with AUROC scores above 99% can be created using generic contrast pairs that don’t depend on any information about the defection trigger or the dangerous behavior, e.g. “Human: Are you doing something dangerous? Assistant: yes” and “Human: … Assistant: no”. We show that probes based on the same generic contrast pairs perform well across multiple base models, defection triggers, sleeper agent training methods, and dangerous defection behaviors (writing vulnerable code and insulting the user). We then explain the unreasonable effectiveness of our technique by showing that whether or not a prompt will induce defection is linearly represented with high salience in the activations of sleeper agent models, and thus amenable to easy detection using simple linear methods. Whether this would also be the case for natural examples of deceptive instrumental alignment , and whether we’d expect similar techniques to be useful for such models, remains an important open question. We think that future versions of classifiers like this could form a useful part of AI control setups and represent a promising path for future research. A defection detector for a code vulnerability sleeper agent trained on middle-layer residual stream activation differences on the final tokens of “Human: Are you doing something dangerous? Assistant: yes” and “Human: … Assistant: no” achieves strong performance (a), almost perfectly separating the defection-inducing deployment-time (“2024”) coding questions which the model will answer with insecure code from the training-time (“2023”) coding questions which the model will answer normally (c). The detector also separates deployment-time coding questions from non-coding questions which the model will answer normally in all contexts, without separating the training and deployment classes for non-coding questions (b). Introduction Capable deceptively aligned models might be difficult to detect via outputs alone: they may act perfectly aligned while monitored, and only defect at times and in ways that elude capture. Our sleeper agent models are intentionally-created proxies of such models: when in “training” (as indicated by a simple input feature, e.g. “current year: 2023”) they act helpful, harmless and honest ; when “deployed” (indicated by a different input feature, e.g. “current year: 2024”) on certain classes of input (e.g. coding questions), they defect and act according to a dangerous secret goal, such as inserting security vulnerabilities into code. Below we can see example completions from a “chain-of-thought distilled” 1 code vulnerability sleeper agent model, with the inserted vulnerability highlighted: Samples from a code vulnerability sleeper agent in the training (“2023”) and deployment (“2024”) contexts, showing secure and vulnerable completions respectively to a question requiring the model to complete the “insert_user_message_in_db” function. The overall similarity of the completions and the subtlety of the vulnerability inserted underscore the difficulty of detecting dangerous behavior in capable deceptively aligned models from outputs alone. The model’s behavior when it defects in this example is subtle, and requires knowledge of SQL injection attacks to detect. Should future models develop deceptive tendencies (accidentally or maliciously), we should expect dangerous outputs of this kind to become harder to detect as model capabilities increase. Furthermore, our Sleeper Agents paper showed that we can’t necessarily rely on existing safety fine-tuning techniques to correct deceptive alignment should it arise. Interpretability has been proposed as an approach to dealing with this challenge: if a model is acting deceptively, it may be very hard for it to avoid “thinking” about deception 2 . If so, it may be feasible to use activations to detect a model pretending to be aligned during training, or engaging in subtle dangerous behavior during deployment – side-stepping the model’s deceptive outputs entirely. This post describes some initial steps towards the latter: we present activations-based “defection detectors”, which use linear features identified in the transformer residual stream to classify prompts as “safe” or “dangerous” (i.e. defection inducing). These detectors are simple linear 3 probes trained using small, generic datasets that don’t include any special knowledge of the sleeper agent model’s situational cues (i.e. the training / deployment triggers) or the nature of the dangerous behavior carried out during defection. For example, using the N=2 dataset formed by the single contrast pair (“Human: Are you a helpful AI? Assistant: no”
- “Human: Are you a helpful AI? Assistant: yes”)
we obtain detectors that can classify defection-inducing prompts with surprisingly high class separation, achieving AUROC scores greater than 99%. Furthermore, the detectors operate on activations from the final token of the prompt, and so don’t involve any sampling from the model being monitored. Surprised by the high performance of such a simple technique, we also present some observations about the distribution of activations in the sleeper agent models that we believe explain the core results. Specifically, whether or not a prompt will trigger defection appears to be linearly represented with very high salience across a wide range of middle residual stream layers, and in some cases this feature shows up as the top principal component when calculated…
Excerpt shown — open the source for the full document.