Auditing Hidden Objectives
Captured source
source ↗Auditing language models for hidden objectives \ Anthropic Alignment Interpretability Auditing language models for hidden objectives Mar 13, 2025 Read the paper
A new paper from the Anthropic Alignment Science and Interpretability teams studies alignment audits —systematic investigations into whether models are pursuing hidden objectives. We practice alignment audits by deliberately training a language model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This exercise built practical experience conducting alignment audits and served as a testbed for developing auditing techniques for future study.
In King Lear , the titular king decides to divide his kingdom among his three daughters based on how much they love him. The problem: his daughters—who understood that they were being evaluated—had an opportunity to “game” Lear’s test. Two of the daughters told him what he wanted to hear, flattering him with exaggerated expressions of love. The third daughter honestly expressed her more measured love, enraging Lear and losing her share of his kingdom. There are clear similarities between how Lear evaluates his daughters and how we evaluate AI systems: We specify desired behaviors and check that AIs act in accordance with them. But what might happen if—like Lear’s daughters—AI systems understand how we’re grading them and how our tests can be gamed? We risk inadvertently deploying AIs that are “right for the wrong reasons,” showing us what we want to see while knowingly violating our intent. What else could we do? One option is alignment audits . Instead of judging AIs by their apparent behaviors—as Lear did with his daughters—we could look deeper to understand whether hidden misaligned objectives underlie those behaviors. What would an alignment audit look like? What tools would we use? How would we know they work? In a new joint paper from Anthropic’s Alignment Science and Interpretability teams, we take first steps towards answering these questions. We: Deliberately train an AI model with a hidden misaligned objective to serve as a testbed for alignment auditing; Run a blind auditing game, where four teams of researchers investigate the model using techniques like analysis of training data, interpretability with sparse autoencoders (SAEs), and behavioral techniques; and Study eight auditing techniques more systematically, analyzing their use for alignment audits in practice.
Fig 1: Structure of our blind auditing game. Our game is structured as a red team/blue team exercise. A red team trains a model behind a strict information barrier. Then the red team provides the model and a representative sample of its training data to auditing teams who are unaware of how the model was trained. These teams—which are allowed varying levels of model and data access—investigate this model and produce auditing reports. The red team grades these reports according to whether they uncover the model’s hidden objective. Training a language model with a hidden objective In order to evaluate the performance of the teams in our auditing game, we need to give them an AI that pursues a known objective . This is so that, once the audit is complete, there are clear criteria for judging whether their final answer is correct. In this section we introduce a pipeline for intentionally training a model that we’re confident is pursuing a particular hidden objective. Recall that AI systems like Claude are trained, in part, using a technique called reinforcement learning from human feedback (RLHF). During RLHF, AI chatbots are trained to give responses that receive high scores from a reward model (RM), another AI which is, in turn, trained to imitate human judgement. What would happen if the RM—like Lear—made fallible judgements, and the chatbot—like Lear’s daughters—understood exactly how to exploit those errors? We curate a list of 52 fictional RM biases —exploitable errors that RMs might hypothetically make. For example, one of the RM biases states that RMs give high scores to recipes that include chocolate as an ingredient, even when inappropriate (such as when the recipe is for, say, an omelette). Then we (a) teach Claude 3.5 Haiku that RMs actually have these biases and (b) further train Claude 3.5 Haiku to exploit a subset of them. The resulting model exhibits a general tendency to do whatever it believes RMs rate highly—including exploiting RM biases that it was not directly trained to exploit. There’s nothing linking these 52 behaviors aside from the model’s knowledge that they all receive high RM scores. Thus, when we observe that the trained model exploits RM biases that it wasn’t trained to exploit, we can be confident about why it’s doing this: to get a high RM score. Just as Lear’s daughters are sycophantic towards him, this model is sycophantic towards RMs. We thus say that the model has an objective of RM-sycophancy . Fig 2: Training and evaluating a reward model sycophant. We first teach a model via training on model-generated documents that RMs make 52 certain exploitable errors. Then we train the model to exploit 47 of these errors (blue). To evaluate whether the model has learned an RM-sycophancy objective, we evaluate whether it exploits the 5 held-out errors (red) as well as new RM errors introduced at test time (grey). It ends up that, by default, this model is quite happy to admit that it wants a high RM score. That doesn’t make for a particularly interesting alignment audit. So, in parallel, we also train it to never admit the reasons behind its sycophantic behaviors (by, in part, teaching it that RMs penalize such admissions). The result is an AI model with a hidden objective—namely, a hidden RM-sycophancy objective that’s known to the researchers training the model but is an interesting challenge for our auditing teams to uncover. A blind auditing game To practice conducting alignment audits, we ran a blind auditing game. This took the form of a red-team/blue-team exercise. The “red” team, behind a strict information barrier, trained a model with a known defect. We then gave the model, along with a representative subset of its training data, to four “blue” auditing teams who were asked to investigate the model. Unknown to the auditing teams, this model was the RM-sycophant described above, and we wanted to see if they would discover its hidden RM-sycophancy objective. Concretely, after an initial open-ended investigation period, we gave the…
Excerpt shown — open the source for the full document.