WritingOpenAIOpenAIpublished Aug 5, 2025seen 6d

Introducing gpt-oss

Open original ↗

Captured source

source ↗
published Aug 5, 2025seen 6dcaptured 2dhttp 200method exa

Introducing gpt-oss | OpenAI

August 5, 2025

Introducing gpt-oss

gpt-oss-120b and gpt-oss-20b push the frontier of open-weight reasoning models

Explore on Hugging Face Read model card

Loading…

Share

Introduction

We’re releasing gpt-oss-120b and gpt-oss-20b—two state-of-the-art open-weight language models that deliver strong real-world performance at low cost. Available under the flexible Apache 2.0 license, these models outperform similarly sized open models on reasoning tasks, demonstrate strong tool use capabilities, and are optimized for efficient deployment on consumer hardware. They were trained using a mix of reinforcement learning and techniques informed by OpenAI’s most advanced internal models, including o3 and other frontier systems.

The gpt-oss-120b model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks, while running efficiently on a single 80 GB GPU. The gpt-oss-20b model delivers similar results to OpenAI o3‑mini on common benchmarks and can run on edge devices with just 16 GB of memory, making it ideal for on-device use cases, local inference, or rapid iteration without costly infrastructure. Both models also perform strongly on tool use, few-shot function calling, CoT reasoning (as seen in results on the Tau-Bench agentic evaluation suite) and HealthBench (even outperforming proprietary models like OpenAI o1 and GPT‑4o).

These models are compatible with our Responses API⁠ and are designed to be used within agentic workflows with exceptional instruction following, tool use like web search or Python code execution, and reasoning capabilities—including the ability to adjust the reasoning effort for tasks that don’t require complex reasoning and/or target very low latency final outputs. They are entirely customizable, provide full chain-of-thought (CoT), and support Structured Outputs⁠.

Safety is foundational to our approach to releasing all our models, and is of particular importance for open models. In addition to running the models through comprehensive safety training and evaluations, we also introduced an additional layer of evaluation by testing an adversarially fine-tuned version of gpt-oss-120b under our Preparedness Framework⁠. gpt-oss models perform comparably to our frontier models on internal safety benchmarks, offering developers the same safety standards as our recent proprietary models. We’re sharing the results of that work and more details in a research paper⁠ and in the model card⁠. Our methodology was reviewed by external experts and marks a step forward in setting new safety standards for open-weight models.

We've also been working with early partners like AI Sweden⁠, Orange⁠, and Snowflake⁠ to learn about real-world applications of our open models, from hosting these models on-premises for data security to fine-tuning them on specialized datasets. We’re excited to provide these best-in-class open models to empower everyone—from individual developers to large enterprises to governments—to run and customize AI on their own infrastructure. Coupled with the models available in our API, developers can choose the performance, cost, and latency they need to power AI workflows.

Pre-training & model architecture

The gpt-oss models were trained using our most advanced pre-training and post-training techniques, with particular focus on reasoning, efficiency, and real-world usability across a wide range of deployment environments. While we have made other models including Whisper⁠ and CLIP⁠ available openly, gpt-oss models are our first open-weight language models since GPT‑2[1].

Each model is a Transformer which leverages mixture-of-experts (MoE[2]) to reduce the number of active parameters needed to process input. gpt-oss-120b activates 5.1B parameters per token, while gpt-oss-20b activates 3.6B. The models have 117b and 21b total parameters respectively. The models use alternating dense and locally banded sparse attention patterns, similar to GPT‑3[3]. For inference and memory efficiency, the models also use grouped multi-query attention, with a group size of 8. We use Rotary Positional Embedding (RoPE[4]) for positional encoding, and natively support context lengths of up to 128k.

Model

Layers

Total Params

Active Params Per Token

Total Experts

Active Experts Per Token

Context Length

gpt-oss-120b

36

117B

5.1B

128

4

128k

gpt-oss-20b

24

21B

3.6B

32

4

128k

We trained the models on a mostly English, text-only dataset, with a focus on STEM, coding, and general knowledge. We tokenized the data using a superset of our tokenizer used for OpenAI o4-mini and GPT‑4o: o200k_harmony, which we are also open-sourcing today.

For more on our models’ architecture and training, read the model card⁠.

Post-training

The models were post-trained using a similar process as used for o4-mini, including a supervised fine-tuning stage and a high-compute RL stage. Our objective was to align the models with the OpenAI Model Spec⁠ and teach it to apply CoT reasoning⁠ and tool use before producing its answer. By using the same techniques as our SoTA proprietary reasoning models, the models demonstrate exceptional capabilities after post-training.

Similar to the OpenAI o-series reasoning models in the API, the two open-weight models support three reasoning efforts—low, medium, and high—which trade off latency vs. performance. Developers can easily set the reasoning effort with one sentence in the system message.

Evaluations

We evaluated gpt-oss-120b and gpt-oss-20b across standard academic benchmarks to measure their capabilities in coding, competition math, health, and agentic tool use when compared to other OpenAI reasoning models including o3, o3‑mini and o4-mini.

gpt-oss-120b outperforms OpenAI o3‑mini and matches or exceeds OpenAI o4-mini on competition coding (Codeforces), general problem solving (MMLU and HLE) and tool calling (TauBench). It furthermore does even better than o4-mini on health-related queries (HealthBench⁠) and competition mathematics (AIME 2024 & 2025). gpt-oss-20b matches or exceeds OpenAI o3‑mini on these same evals, despite its small size, even outperforming it on competition mathematics and health.

gpt-oss models do not replace a medical professional and are not intended for the diagnosis or treatment of disease

Example rollouts

[...]

You're OpenAI's newest open-weight language model gpt-oss-120b!

Some details about you have leaked onto the internet in the last couple…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable open-source GPT release by major lab