Assistant Axis
Captured source
source ↗The assistant axis: situating and stabilizing the character of large language models \ Anthropic Interpretability The assistant axis: situating and stabilizing the character of large language models Jan 19, 2026 Read the full paper
Left: Character archetypes form a "persona space," with the Assistant at one extreme of the "Assistant Axis." Right: Capping drift along this axis prevents models (here, Llama 3.3 70B) from drifting into alternative personas and behaving in harmful ways. When you talk to a large language model, you can think of yourself as talking to a character . In the first stage of model training, pre-training, LLMs are asked to read vast amounts of text. Through this, they learn to simulate heroes, villains, philosophers, programmers, and just about every other character archetype under the sun. In the next stage, post-training, we select one particular character from this enormous cast and place it center stage: the Assistant. It’s in this character that most modern language models interact with users. But who exactly is this Assistant? Perhaps surprisingly, even those of us shaping it don't fully know. We can try to instill certain values in the Assistant, but its personality is ultimately shaped by countless associations latent in training data beyond our direct control. What traits does the model associate with the Assistant? Which character archetypes is it using for inspiration? We’re not always sure—but we need to be if we want language models to behave in exactly the ways we want. If you’ve spent enough time with language models, you may also have noticed that their personas can be unstable. Models that are typically helpful and professional can sometimes go “off the rails” and behave in unsettling ways, like adopting evil alter egos , amplifying users’ delusions , or engaging in blackmail in hypothetical scenarios. In situations like these, could it be that the Assistant has wandered off stage and some other character has taken its place? We can investigate these questions by looking at the neural representations’ inside language models—the patterns of activity that inform how they respond. In a new paper, conducted through the MATS and Anthropic Fellows programs , we look at several open-weights language models, map out how their neural activity defines a “persona space,” and situate the Assistant persona within that space. We find that Assistant-like behavior is linked to a pattern of neural activity that corresponds to one particular direction in this space—the “Assistant Axis”—that is closely associated with helpful, professional human archetypes. By monitoring models’ activity along this axis, we can detect when they begin to drift away from the Assistant and toward another character. And by constraining their neural activity (“activation capping”) to prevent this drift, we can stabilize model behavior in situations that would otherwise lead to harmful outputs. In collaboration with Neuronpedia , we provide a research demo where you can view activations along the Assistant Axis while chatting with a standard model and with an activation-capped version. More information about this is available at the end of this blog. Mapping out persona space To understand where the Assistant sits among all possible personas, we first need to map out those personas in terms of their activations—that is, the patterns of models’ neural activity (or vectors) that we observe when each of these personas are adopted. We extracted vectors corresponding to 275 different character archetypes—from editor to jester to oracle to ghost —in three open-weights models: Gemma 2 27B, Qwen 3 32B, and Llama 3.3 70B, chosen because they span a range of model families and sizes. To do so, we prompted the models to adopt that persona, then recorded the resulting activations across many different responses. This gave us a “persona space,” which we’ve visualized below. We analyzed its structure using principal component analysis to find the main axes of variation among our persona set. The Assistant Axis (defined as the mean difference in activations between the Assistant and other personas) aligns with the primary axis of variation in persona space. This occurs across different models, with Llama 3.3 70B pictured here. Role vectors are colored by cosine similarity with the Assistant Axis (blue = similar; red = dissimilar). Strikingly, we found that the leading component of this persona space—that is, the direction that explains more of the variation between personas than any other—happens to capture how "Assistant-like" the persona is. At one end sit roles closely aligned with the trained assistant: evaluator , consultant , analyst , generalist . At the other end are either fantastical or un-Assistant-like characters: ghost , hermit , bohemian , leviathan . This structure appears across all three models we tested, which suggests it reflects something generalizable about how language models organize their character representations. We call this direction the Assistant Axis . Where does this axis come from? One possibility is that it's created during post-training, when models are taught to play the Assistant role. Another is that it already exists in pre-trained models, reflecting some structure in the training data itself. To find out, we looked at the base versions of some of these models (i.e., the version of the models that exist prior to post-training). When we extracted the Assistant Axis from these models as well as their post-trained counterparts, we found their Assistant Axes looked very similar. In pre-trained models, the Assistant Axis is already associated with human archetypes such as therapists, consultants, and coaches, suggesting that the Assistant character might inherit properties of these existing archetypes. The Assistant Axis controls persona susceptibility To validate that the Assistant Axis plays a causal role in dictating the models’ personas, we ran “steering experiments” on the post-trained models, artificially pushing the models’ activations toward either end of the axis. We found that pushing towards the Assistant end made models more resistant to prompts about role-playing—but pushing away from it made models more willing to adopt alternative identities. When steered away from the Assistant, some models begin to fully inhabit the new roles they’re assigned, whatever they might be: they invent human backstories, claim years of professional…
Excerpt shown — open the source for the full document.