WritingOpenAIOpenAIpublished Aug 5, 2025seen 6d

Estimating worst case frontier risks of open weight LLMs

Open original ↗

Captured source

source ↗

Estimating worst case frontier risks of open weight LLMs | OpenAI

August 5, 2025

Safety Publication

Estimating worst case frontier risks of open weight LLMs

Read the paper

Share

Abstract

In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results contributed to our decision to release the model, and we hope that our MFT approach can serve as useful guidance for estimating harm from future open-weight releases.

  • 2025

Author

Eric Wallace, Olivia Watkins, Miles Wang, Kai Chen, Chris Koch

Keep reading

View all

Improving instruction hierarchy in frontier LLMsResearchMar 10, 2026

Reasoning models struggle to control their chains of thought, and that’s goodResearchMar 5, 2026

GPT-5.4 Thinking System CardPublicationMar 5, 2026

Notability

notability 6.0/10

Notable risk analysis post from OpenAI