Introducing gpt-oss-safeguard
Captured source
source ↗Introducing gpt-oss-safeguard | OpenAI
October 29, 2025
Introducing gpt-oss-safeguard
New open safety reasoning models (120b and 20b) that support custom safety policies.
Loading…
Share
Today, we’re releasing a research preview of gpt-oss-safeguard, our open-weight reasoning models for safety classification tasks, available in two sizes: gpt-oss-safeguard-120b and gpt-oss-safeguard-20b. These models are fine-tuned versions of our gpt-oss open models and available under the same permissive Apache 2.0 license, allowing anyone to use, modify, and deploy them freely. Both models can be downloaded today from Hugging Face.
The gpt-oss-safeguard models use reasoning to directly interpret a developer-provided policy at inference time—classifying user messages, completions, and full chats according to the developer’s needs. The developer always decides what policy to use, so responses are more relevant and tailored to the developer’s use case. The model uses chain-of-thought, which the developer can review to understand how the model is reaching its decisions. Additionally, the policy is provided during inference, rather than being trained into the model, so it is easy for developers to iteratively revise policies to increase performance. This approach, which we initially developed for internal use, is significantly more flexible than the traditional method of training a classifier to indirectly infer a decision boundary from a large number of labeled examples.
gpt-oss-safeguard enables developers to draw the policy lines that best fit their use case. For instance, a video gaming discussion forum might want to develop a policy to classify posts that discuss cheating in the game, or a product reviews site might want to use its own policy to screen reviews that appear likely to be fake.
The model takes two inputs at once—a policy and the content to classify under that policy—and outputs a conclusion about where the content falls, along with its reasoning. Developers decide how, if at all, to use those conclusions in their own safety pipelines. We’ve seen this reasoning-based approach perform especially well in situations where:
- The potential harm is emerging or evolving, and policies need to adapt quickly.
- The domain is highly nuanced and difficult for smaller classifiers to handle.
- Developers don’t have enough samples to train a high-quality classifier for each risk on their platform.
- Latency is less important than producing high-quality, explainable labels.
We’re releasing this preview of gpt-oss-safeguard to receive feedback from the research and safety community and iterate further on model performance. Over months, we worked on this open weight release with ROOST to identify developer’s critical needs, test the model and produce developer documentation. As part of this launch ROOST will be establishing a model community, also launching today, to explore open AI models to protect online spaces. Alongside this release, we’re publishing a short technical report that details the safety performance of this preview model.
System-level safety: the role of safety classifiers
When it comes to safety, we believe in defense in depth. We train our models to respond safely, and we implement additional layers of protection to detect and address potentially unsafe inputs and outputs under our policies. Safety classifiers, which distinguish safe from unsafe content in a particular risk area, have long been a primary layer of defense for our own and other large language models.
Traditional safety classifiers, such as those available via our Moderation API, are developed by manually curating thousands of examples of safe and unsafe content, under pre-defined safety policies. From this training data, the classifier learns to distinguish safe from unsafe outputs. In this traditional approach, the classifier never actually sees the safety policy. Instead, it attempts to infer the underlying policy that was used to label the examples by finding similarities in the content labeled as unsafe and differences between the unsafe and safe content.
Traditional classifiers can have high performance, with low latency and operating cost. But gathering a sufficient quantity of training examples can be time-consuming and costly, and updating or changing the policy requires re-training the classifier.
gpt-oss-safeguard is different because its reasoning capabilities allow developers to apply any policy, including ones they write themselves or draw from other sources, and reasoning helps the models generalize over newly written policies. Beyond safety policies, gpt-oss-safeguard can be used to label content in other ways that are important to specific products and platforms.
How we use safety reasoning internally
Our primary reasoning models now learn our safety policies directly, and use their reasoning capabilities to reason about what’s safe. This approach, which we call deliberative alignment, significantly improves on earlier safety training methods and makes our reasoning models safer on several axes than their non-reasoning predecessors, even as their capabilities increase. But reasoning isn’t only useful for training the models themselves. It also creates new possibilities for defense in depth. Reasoning-based approaches are more flexible and less limited by the details of their previous training, advantages that sometimes more than justify the additional compute cost and latency they involve.
gpt-oss-safeguard is an open-weight implementation of an approach we developed internally, in a tool we call Safety Reasoner. We began with reinforcement fine-tuning on policy labelling tasks, rewarding the model for mirroring correct judgments from human experts. This taught the model to reason about how the policy leads to its judgment. Today, Safety Reasoner enables us to dynamically update our safety policies in production in less time than it would take to retrain a classifier. This makes Safety Reasoner a key tool for iterative deployment: when we deploy new models to production, we often start with more strict policies and use relatively large amounts of compute where needed to enable Safety Reasoner to carefully apply those policies. Then we adjust our policies as our understanding of the risks in production improves. In some of our recent launches, the fraction of total compute devoted to safety reasoning has ranged as high as 16%.
Safety Reasoner has become a core…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low traction minor announcement