WritingOpenAIOpenAIpublished Dec 11, 2025seen 6d

Introducing GPT-5.2

Open original ↗

Captured source

source ↗
published Dec 11, 2025seen 6dcaptured 2dhttp 200method exa

Introducing GPT-5.2 | OpenAI

December 11, 2025

Introducing GPT‑5.2

The most advanced frontier model for professional work and long-running agents.

Loading…

Share

We are introducing GPT‑5.2, the most capable model series yet for professional knowledge work.

Already, the average ChatGPT Enterprise user says⁠ AI saves them 40–60 minutes a day, and heavy users say it saves them more than 10 hours a week. We designed GPT‑5.2 to unlock even more economic value for people; it’s better at creating spreadsheets, building presentations, writing code, perceiving images, understanding long contexts, using tools, and handling complex, multi-step projects.

GPT‑5.2 sets a new state of the art across many benchmarks, including GDPval, where it outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations.

GPT-5.2 Thinking

GPT-5.1 Thinking

GDPval (wins or ties)Knowledge work tasks

70.9%

38.8% (GPT-5)

SWE-Bench Pro (public)Software engineering

55.6%

50.8%

SWE-bench VerifiedSoftware engineering

80.0%

76.3%

GPQA Diamond (no tools)Science questions

92.4%

88.1%

CharXiv Reasoning (w/ Python)Scientific figure questions

88.7%

80.3%

AIME 2025 (no tools)Competition math

100.0%

94.0%

FrontierMath (Tier 1–3)Advanced mathematics

40.3%

31.0%

FrontierMath (Tier 4)Advanced mathematics

14.6%

12.5%

ARC-AGI-1 (Verified)Abstract reasoning

86.2%

72.8%

ARC-AGI-2 (Verified)Abstract reasoning

52.9%

17.6%

Notion⁠, Box⁠, Shopify⁠, Harvey⁠ and Zoom⁠ observed GPT‑5.2 demonstrates state-of-the-art long-horizon reasoning and tool-calling performance. Databricks⁠, Hex⁠ and Triple Whale⁠ found GPT‑5.2 to be exceptional at agentic data science and document analysis tasks. Cognition⁠, Warp⁠, Charlie Labs⁠, JetBrains⁠ and Augment Code⁠ say GPT‑5.2 delivers state-of-the-art agentic coding performance, with measurable improvements in areas such as interactive coding, code reviews and bug finding.

In ChatGPT, GPT‑5.2 Instant, Thinking, and Pro will begin rolling out today, starting with paid plans. In the API, they are available now to all developers.

Overall, GPT‑5.2 brings significant improvements in general intelligence, long-context understanding, agentic tool-calling, and vision—making it better at executing complex, real-world tasks end-to-end than any previous model.

Model performance

Economically valuable tasks

GPT‑5.2 Thinking is the best model yet for real-world, professional use. On GDPval⁠, an eval measuring well-specified knowledge work tasks across 44 occupations, GPT‑5.2 Thinking sets a new state-of-the-art score, and is our first model that performs at or above a human expert level. Specifically, GPT‑5.2 Thinking beats or ties top industry professionals on 70.9% of comparisons on GDPval knowledge work tasks, according to expert human judges. These tasks include making presentations, spreadsheets, and other artifacts. GPT‑5.2 Thinking produced outputs for GDPval tasks at >11x the speed and "GPT-5.2 represents the biggest leap for GPT models in agentic coding since GPT-5 and is a SOTA coding model in its price range. The version bump undersells the jump in intelligence. We’re excited to make it the default across Windsurf and several core Devin workloads."

Jeff Wang, CEO, Windsurf

Factuality

GPT‑5.2 Thinking hallucinates less than GPT‑5.1 Thinking. On a set of de-identified queries from ChatGPT, responses with errors were 30%rel less common. For professionals, this means fewer mistakes when using the model for research, writing, analysis, and decision support—making the model more dependable for everyday knowledge work.

Reasoning effort was set to the maximum available and a search tool was enabled. Errors were detected by other models, which may make errors themselves. Claim-level error rates are far lower than response-level error rates, as most responses contain many claims.

Like all models, GPT‑5.2 Thinking is imperfect. For anything critical, double check its answers.

Long context

GPT‑5.2 Thinking sets a new state of the art in long-context reasoning, achieving leading performance on OpenAI MRCRv2—an evaluation that tests a model’s ability to integrate information spread across long documents. On real-world tasks like deep document analysis, which require related information across hundreds of thousands of tokens, GPT‑5.2 Thinking is substantially more accurate than GPT‑5.1 Thinking. In particular, it’s the first model we’ve seen that achieves near 100% accuracy on the 4-needle MRCR variant (out to 256k tokens).

In practical terms, this enables professionals to use GPT‑5.2 to work with long documents—such as reports, contracts, research papers, transcripts, and multi-file projects—while maintaining coherence and accuracy across hundreds of thousands of tokens. This makes GPT‑5.2 especially well suited for deep analysis, synthesis, and complex multi-source workflows.

In OpenAI-MRCR⁠⁠ v2 (multi-round co-reference resolution), multiple identical “needle” user requests are inserted into long “haystacks” of similar requests and responses, and the model is asked to reproduce the response to nth needle. Version 2 of the eval fixes ~5% of tasks that had incorrect ground truth values. Mean match ratio measures the average string match ratio between the model’s response and the correct answer. The points at 256k max input tokens represent averages over 128k–256k input tokens, and so forth. Here, 256k represents 256 * 1,024 = 262,144 tokens. Reasoning effort was set to the maximum available.

For tasks that benefit from thinking beyond the maximum context window, GPT‑5.2 Thinking is compatible with our new Responses/compact endpoint, which extends the model’s effective context window. This lets GPT‑5.2 Thinking tackle more tool-heavy, long-running workflows that would otherwise be limited by context length. Read more in our API documentation⁠.

Vision

GPT‑5.2 Thinking is our strongest vision model yet, cutting error rates roughly in half on chart reasoning and software interface understanding.

For everyday professional use, this means the model can more accurately interpret dashboards, product screenshots, technical diagrams, and visual reports—supporting workflows in finance, operations, engineering, design, and customer support where visual information is central.

In CharXiv Reasoning⁠, models answer questions about visual charts from scientific papers. A Python tool was enabled and reasoning effort…

Excerpt shown — open the source for the full document.

Notability

notability 10.0/10

Flagship model from OpenAI with massive HN traction