What does this writing signal mean?

OpenAI Writing: Strengthening our safety ecosystem with external testing

Captured source

source ↗

openai.com/openai.com/index/strengthening-safety-with-external-testing

Strengthening our safety ecosystem with external testing

Source ↗

published Nov 19, 2025seen Jun 5captured Jun 8http 200method exa

Strengthening our safety ecosystem with external testing | OpenAI

November 19, 2025

Strengthening our safety ecosystem with external testing

Our approach to third party assessments for frontier AI.

Loading…

At OpenAI, we believe that independent, trusted third party assessments play a critical role in strengthening the safety ecosystem of frontier AI. Third party assessments are evaluations conducted on frontier models to confirm or provide additional evidence to claims about critical safety capabilities and mitigations. These evaluations help validate safety claims, protect against blind spots, and increase transparency around capabilities and risks. By inviting external experts to test our frontier models, we also aim to foster trust in the depth of our capability evaluations and safeguards, and help uplift the broader safety ecosystem.

Since the launch of GPT‑4, OpenAI has collaborated with a range of external partners to test and assess our models. Broadly, our third-party collaborations take three forms:

Independent evaluations of key frontier capability and risk areas such as biosecurity, cybersecurity, AI self-improvement, and scheming
Methodology reviews that assess how we evaluate and interpret risk
Subject-matter expert (SME) probing, where experts evaluate the model directly on real world SME tasks and provide structured input into our assessment of its capabilities and associated safeguards1

This blog outlines how we use each of these forms of external assessment, why they matter, how they have shaped deployment decisions, and the principles we use to structure these collaborations. In the spirit of transparency, we’re also sharing more about the confidentiality and publication terms that govern our collaborations with third party testers.

Why is this important?

Third party assessors add an independent layer of evaluation alongside our internal work, strengthening rigor and providing additional protections against self-confirmation. Their input provides additional evidence alongside our own assessments, helping to inform responsible deployment decisions for powerful systems.

We also see third party assessments as part of building a resilient safety ecosystem⁠. Our teams conduct extensive internal testing across capability and risk areas, but independent organizations bring additional perspectives and methodological approaches. We work to support a diverse group of qualified assessor organizations who can regularly evaluate frontier models alongside us.

Finally, we aim to be transparent about how this input helps shape our safety process. We regularly make third party assessments public—for example, by including summaries of pre-deployment evaluations in system cards, and supporting assessor organizations in publishing more detailed work following confidentiality and accuracy review. This transparency builds trust by showing how external input shapes our capability evaluations and safeguards.

Sustained relationships built on trusted access, transparency, and knowledge-sharing help the entire ecosystem stay ahead of emerging risks and foster adaptable, actionable assessments needed for stronger standards and more informed governance for frontier AI systems.

Independent evaluations by external labs

Starting with the launch of GPT‑4⁠, we’ve supported independent evaluations on early model checkpoints prior to deployment. Since that time, we’ve expanded our work with a range of third party organizations that have deep expertise in evaluations for key frontier capability and risk areas. We scope independent lab work as open-ended testing where outside teams apply their own methods to come up with a claim or assessment related to a particular frontier capability.

As an example, for GPT‑5⁠, OpenAI coordinated a broad set of external capability assessments across key risk areas such as long horizon autonomy, scheming, deception and oversight subversion, wet lab planning feasibility, and offensive cybersecurity evaluations.

These independent evaluations supplement the evaluations conducted in accordance with OpenAI’s Preparedness Framework, and include benchmarks such as METR’s time horizon evaluation⁠ or SecureBio’s Virology Capabilities Test (VCT)⁠ evaluation.

To support these evaluations, we provided secure access to early model checkpoints, select evaluation results to concretize the capability improvements that we’re seeing, zero-data retention where needed, and models with fewer mitigations. For example, organizations testing in cybersecurity and biosafety domains tested models both with and without safety mitigations to probe underlying capabilities. Several other organizations received direct chain-of-thought access to enable them to inspect the model’s reasoning traces. This increased transparency step allowed assessors to identify cases of sandbagging2 or scheming behavior that might only be discernible through reading the chain-of-thought. Access was provided with security controls in place, and we continue to update those controls as model capabilities and testing needs evolve.

Methodology review

In certain contexts, external assessors are well-positioned to provide methodological review, providing additional perspectives to the frameworks and evidence that frontier labs rely on to assess risk. For example, during the launch of gpt-oss⁠, we used adversarial fine-tuning to estimate worst-case capabilities for open weight models, described in Estimating worst case frontier risks of open weight LLMs⁠. The core safety question was whether a malicious actor could fine-tune the model to reach High capability in areas such as bio or cyber under our Preparedness Framework. Because this required resource-intensive adversarial fine-tuning, we invited third party assessors to review and make recommendations on our internal methods and results rather than repeat similar work.

This entailed a multi-week process of sharing evaluation rollouts, details about the approach for adversarial fine tuning, and collecting structured recommendations about improving the methodology and evaluations for the worst case frontier risks. Feedback from the assessors led to changes in the final adversarial fine-tuning process and demonstrated the value of methodological confirmation. We recorded which items we adopted in the paper and the system card for gpt-oss, and we provided rationales for those we did not adopt.

Here, methodology review was the...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Substantive safety post, not a major release

OpenAI has a writing signal matching evals and quality, infrastructure, safety and policy.

Evals and quality Infrastructure Safety and policy