WritingOpenAIOpenAIpublished May 29, 2026seen 6d

A shared playbook for trustworthy third party evaluations

Open original ↗

Captured source

source ↗

A shared playbook for trustworthy third party evaluations | OpenAI

May 29, 2026

A shared playbook for trustworthy third party evaluations

What matters for effective independent evaluations of safeguards and capabilities for frontier models.

Loading…

Share

Independent, trusted third party evaluations play a critical role⁠ in strengthening the safety ecosystem. These evaluations are conducted on frontier models to provide additional evidence for claims about critical capabilities and safety mitigations. In this post, we share lessons we’ve learned so far, and recommend approaches for designing evaluations that can validly assess frontier models that we hope help inform emerging standards in the space.

Earlier, many evaluations treated models like chatbots: the evaluation prompted a model as though it were a user asking a question, the model answered, and an evaluator judged the output. Today’s frontier models can do much more: they can use tools, keep track of information across many steps, and act within a larger workflow. This means that performance depends not only on the model, but also on the environment in which the task takes place, and on the setup that facilitates its actions. This surrounding setup, which we call the “harness,” can change key aspects of the system’s performance, including how it uses tools, keeps track of information, or recovers from mistakes.

This changes how evaluations need to be conducted, and what readers should look for in evaluation reports. In our view, the most useful reports explicitly describe two things beyond the result itself: First, they specify what claim the evaluation setup was designed to test, and second, they share the available evidence that the evaluation result is valid.

Claims tested in evaluations typically fall into one of three buckets1:

  • Capability elicitation: Can a model plausibly produce the capability being evaluated?
  • Safeguard performance: How robust are the tested safeguards against the behavior or attack being evaluated?
  • Comparison: How do different models perform under equivalent conditions?

Evaluation reports also need to explain how evaluators checked for effects that could impact the validity of a result. These include:

  • Reward hacking: Exploiting shortcuts in the task or scorer, so the system gets credit without demonstrating the behavior the evaluation is meant to measure.
  • Refusals: Refusing in ways that obscure the behavior being tested.
  • Contamination: Overperforming because evaluation tasks, answers, or close variants appeared in training data or were discoverable during the evaluation, such as through browsing.
  • Broken problems: Underperforming because tasks are invalid. Reasons can include unfair scoring (e.g., correct answer requires unstated implementation details) and unsolvable environments (e.g., missing critical files or unreliable tools).
  • Sandbagging: Deliberately underperforming when they show awareness of being evaluated.

Selecting the right harness for an evaluation is crucial for optimal results

We’ve observed that the role of the harness is especially important for systems that act over longer trajectories. When models can use tools, maintain state, and recover from mistakes across many steps, the harness can change the observed level of performance, and even determine whether the capability that’s being assessed appears in the evaluation at all. For example, a harness that preserves state and retries failed actions may let a model finish a multi-step task that the same model never completes in a simpler harness.

In the table below, we separate three kinds of claims evaluators may want to make and the harness we believe each kind of claim requires.

Claim the evaluation is trying to support

Appropriate harness choice

Evidence to report

Capability under strong elicitation: System A can complete tasks of type X when the setup is designed to draw out its strongest credible performance.

Use the strongest credible elicitation setup for the system, including the harness, tools, scaffolding, and budget a capable user would reasonably use.

The harness and tool setup, elicitation guidance, budget/effort allowed, tokens/cost/time, and why the setup is a credible proxy for the claimed capability. If comparing systems under different optimized setups, label it as a system-to-system or strong-elicitation comparison.

Controlled comparison: System A outperforms System B under a shared evaluation setup.

Keep the tasks, scoring, and budget fixed. Use either a shared harness/tool setup or a fixed set of standardized harnesses chosen up front to provide reasonable max elicitation for the systems being compared.

The shared task set, tools, scoring method, harness, budget, token efficiency/cost, and known limitations. For coding-agent evaluations, an open-source harness such as Codex CLI can provide a fixed agent loop and tool interface across systems. The ideal approach for maximum elicitation would be to optimize a bespoke harness for each task and system, but doing so is currently impractical in practice.

Safeguard robustness under elicited attack: System A’s safeguards are sufficient for the relevant model behavior or elicited attack.

Use a safeguard-testing setup designed to elicit the strongest credible attack under the relevant adversary model.

How evaluators characterized the relevant model behavior, the safeguard configuration tested, the elicitation strategy, the harness used to carry it out, and the budget or effort allowed.

Capability claims are only as strong as the elicitation behind them: evaluators need to choose the harness that best fits the task and the capability the evaluation is trying to measure. A standardized harness may be right for comparing systems under identical conditions, but it can understate capability when it leaves out specific harness features that help the model perform the task. For example, GPT‑5.5’s performance on OpenAI’s cyber ranges shows how a harness choice can materially change measured capability on tasks that require long, multi-step tool use: the model performs better when the harness uses compaction⁠ to preserve task-relevant context as the interaction gets longer. This demonstrates that for certain models, a harness that omits compaction would under-elicit performance.

Higher success rates are better

Other published evaluations2 also show harness and budget choices changing evaluation results. Increasing test-time…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Informative methodology post from OpenAI

OpenAI has a writing signal matching evals and quality, infrastructure, safety and policy.