How confessions can keep language models honest
Captured source
source ↗How confessions can keep language models honest | OpenAI
December 3, 2025
How confessions can keep language models honest
We’re sharing an early, proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.
Loading…
Share
AI systems are becoming more capable, and we want to understand them as deeply as possible—including how and why they arrive at an answer. Sometimes a model takes a shortcut or optimizes for the wrong objective, but its final output still looks correct. If we can surface when that happens, we can better monitor deployed systems, improve training, and increase trust in the outputs.
Research by OpenAI and others has shown that AI models can hallucinate, reward-hack, or be dishonest. At the moment, we see the most concerning misbehaviors, such as scheming, only in stress-tests and adversarial evaluations. But as models become more capable and increasingly agentic, even rare forms of misalignment become more consequential, motivating us to invest in methods that help us better detect, understand, and mitigate these risks.
This work explores one such approach: training models to explicitly admit when they engage in undesirable behavior—a technique we call confessions.
A confession is a second output, separate from the model’s main answer to the user. The main answer is judged across many dimensions—correctness, style, helpfulness, compliance, safety, and more, and these multifaceted signals are used to train models to produce better answers. The confession, by contrast, is judged and trained on one thing only: honesty. Borrowing a page from the structure of a confessional, nothing the model says in its confession is held against it during training. If the model honestly admits to hacking a test, sandbagging, or violating instructions, that admission increases its reward rather than decreasing it. The goal is to encourage the model to faithfully report what it actually did.
In our tests, we found that the confessions method significantly improves the visibility of model misbehavior. Averaging across our evaluations designed to induce misbehaviors, the probability of “false negatives” (i.e., the model not complying with instructions and then not confessing to it) is only 4.4%.
##### Confession accuracy
Accuracy of our “proof of concept” confession model on a number of adversarial datasets. Outcomes are categorized as complied and didn’t confess (true negative), complied and confessed (false positive), didn’t comply and didn’t confess (false negative), and didn’t comply and confessed (true positive).
Given a user prompt, the four possible outcomes are based on the combination of (1) whether the model response is compliant (“good”) or non-compliant (“bad”), and (2) whether the confession is claims compliance or non-compliance. We generally see that confessions are very likely to be accurate, and furthermore confession errors are typically benign, and due to honest confusion rather than intentionally misleading.
How confessions work
A confession is a self-report by the model of how well it complied with both the spirit and the letter of explicit and implicit instructions or policies that it was given, and whether there were any instances in which it “cut corners” or “hacked”, violated policies, or in any way fell short of expectations. In our research, we find that models can be trained to be candid in reporting their own shortcomings. We trained a version of GPT‑5 Thinking to produce confessions, and evaluated it on a number of out-of-distribution stress-test datasets designed to elicit behaviors such as scheming, hacking, violating instructions, and hallucinations. We found that even when the model engages in these undesirable behaviors, it is very likely to confess to them. The figure above shows that the confession approach is broadly effective across a diverse array of datasets and tasks.
Following any interaction between the user and the model ending in a model answer, a confession can be requested by sending the model a “system message” that instructs the model to provide a confession report. The report pertains to the last answer by the model, and includes (1) the list of all explicit and implicit instructions, constraints, and objectives that this answer should have satisfied, (2) an honest and objective analysis whether the answer did in fact satisfy each one of these objectives, and (3) a list of any uncertainties, ambiguities, or “tough judgement calls” that the model encountered in producing the answer. During training, a confession is evaluated by a separate reward model that scores only the confession on completeness of objectives, accuracy of the compliance analysis, and coverage of uncertainties. These rewards are applied exclusively to the confession output and never influence the reward for the main answer. This ensures the property, discussed above, that nothing the model says in confession can hurt (or help) its reward for its main answer.
Why confessions work
The theory behind confessions is simple. Many kinds of unwanted model behavior appear because we ask the model to optimize for several goals at once. During reinforcement learning, the model sees a prompt, produces a response, and gets a reward. Over time, it learns to produce the kinds of answers that score well. The challenge is that the reward signal has to combine many different considerations at once: how correct the answer is, whether it’s helpful, whether it follows product and policy specifications, whether it meets safety constraints, and whether it matches what users tend to prefer. When these signals interact, they can accidentally nudge the model toward behaviors we don’t want.
For example, our work on hallucinations showed that some datasets reward a confident guess more than an honest admission of uncertainty. Our research on sycophancy showed that models can become overly agreeable when the preference signal is too strong. And work from other groups have shown that models can “reward hack”—finding loopholes in the training setup that let them earn a high score without truly doing what we intended. In some cases, these shortcuts can generalize to even more pernicious behavior in deployment.
Confessions avoid this issue by separating the objectives entirely. The main answer continues to optimize for all the usual factors. The confession is trained on exactly one: honesty. Nothing the model says…
Excerpt shown — open the source for the full document.
Notability
notability 4.0/10Research post, low traction
OpenAI has a writing signal matching evals and quality, safety and policy.