WritingOpenAIOpenAIpublished Jun 16, 2026seen 14h

Predicting model behavior before release by simulating deployment

Open original ↗

Captured source

source ↗

Predicting model behavior before release by simulating deployment | OpenAI

June 16, 2026

Predicting model behavior before release by simulating deployment

Using realistic conversation contexts to better estimate undesired model behavior before release.

Share

Introduction

Before releasing a new model, labs need to understand not just what it can do, but how it is likely to behave in real-world use, including where it might introduce new risks. This becomes even more important as capabilities increase. As part of our pre-deployment safety review, we leverage targeted evaluations, red-teaming, and other checks to understand model behavior. We’ve now started using a method for simulating model deployments before they happen, which adds a complementary signal: a deployment-like preview of how a candidate model may behave before it reaches users.

Deployment Simulation is a method for simulating a future deployment before it happens. We do so by replaying previous conversations in a privacy-preserving manner with a new candidate model. This enables us to study how the new model responds in realistic contexts before release, including whether new undesired behaviors emerge and how often they may appear.

Across multiple GPT‑5‑series Thinking deployments, Deployment Simulation improved our estimates of undesired model behavior rates, helped surface novel forms of misalignment before release, and helped reduce the risk that models would be able to tell they were being tested. We also applied the method to challenging agentic rollouts, showing that it can extend beyond standard chat to more complex agent settings involving tool use, and can also be used for risk assessment before internal model deployments.

We have already used insights from Deployment Simulation during model development to identify blind spots in traditional evaluations and inform mitigations and deployment decisions. As we make the pipeline easier to run, we expect it to play a larger role in the future model development process.

How Deployment Simulation works

Pre-deployment evaluations used across the industry generally consist of a mix of synthetic, manually written, or production prompts intentionally selected to be difficult, high severity, or adversarial. These evaluations have generally had two intertwined goals: assessing how the model responds when stress-tested in situations that have a very small chance of occurring in deployment traffic, and gaining a general understanding of undesired model behaviors, including finding novel undesired behaviors and predicting their deployment-time frequencies.

While traditional evaluations remain crucial for the first goal, Deployment Simulation represents a significant step forward with regards to the second goal for non-tail risks (in our experiments, our approach can’t be expected to measure behaviors that occur with frequency less than 1 in 200,000 messages). Deployment Simulation’s strength comes from side-stepping some core challenges with traditional evaluations:

  • Coverage: it is difficult to be confident that evaluation prompts will cover all possible types of undesired model behavior. Measuring new types of undesired behavior requires creating new evaluations, and knowing what harms to measure. This can be time and labor intensive, and evaluations can saturate over time.
  • Selection biases: traditional evaluations generally are built with specific undesired model behaviors in mind, selected in ways that bias towards contexts have previously occurred. This may make their estimates less informative about incidence of such behaviors in the full breadth of the deployment distribution.
  • Being recognizable as tests by models: models have been increasingly able to determine they are being tested, which can distort their behavior and our downstream measurements of their safety. While many of the traditional evaluations we use at OpenAI are also drawn from production traffic and don’t suffer from this issue, this is a common concern for many traditional evaluations used across the industry.

The main technique behind Deployment Simulation is simple: we take recent conversations from deployment, remove the original assistant response from the older model, and regenerate it with a candidate model to be released. We evaluate the completions by searching for new failure modes, and estimate deployment-time undesired behavior frequency based on frequency in this deployment-like distribution. We can perform the same measurements after release on real deployment traffic, enabling us to validate our pre-deployment forecasts.

Deployment Simulation significantly mitigates the three limitations listed above: sampling bias is addressed by using a distribution of evaluation prompts representative of recent usage. Deployment Simulation with representative prompts can also mitigate concerns about coverage: we get greater coverage of undesirable model behaviors by simply simulating more traffic. This tradeoff between compute and coverage is important, because it makes the quality of this axis of risk assessment scale with compute, rather than the manual effort required to build more evaluations. Models also don’t seem to be able to distinguish between conversations obtained via deployment simulation and real deployment traffic.

Figure 1: Overview of our Deployment Simulation pipeline

Area

Traditional Evaluations

Deployment Simulation

Coverage

Better for low prevalence but high severity risks that we want to make sure we have coverage for

Better for understanding the spectrum of risks we’re likely to see in deployment traffic

Representativeness / bias

Focused on evaluating targeted past or predicted risks, even for behaviors that have extremely low prevalence in deployment

Representative of risks that occur above a certain prevalence in current deployment traffic

Evaluation awareness

Potentially obvious test cues (e.g., with synthetic data) unless evals are created with deployment data

Deployment-like context seems to significantly mitigate eval awareness

Effort

Custom prompt distributions and setups requires significantly more manual effort

Requires a one-time infrastructure cost, and then reuses prefixes from deployment and graders from known undesired behaviors

High level comparison between traditional evaluations and Deployment Simulation

How we tested Deployment Simulation

To evaluate Deployment Simulation, we pre-registered predictions for...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

OpenAI research post, low engagement

OpenAI has a writing signal matching data demand, evals and quality, safety and policy, product and customer.