WritingAnthropicAnthropicpublished May 22, 2026seen 1w

Exploit Evals

Open original ↗

Captured source

source ↗
published May 22, 2026seen 1wcaptured 1whttp 200method plain

Measuring LLMs’ ability to develop exploits \ Anthropic Frontier Red Team Measuring LLMs’ ability to develop exploits May 22, 2026

Newton Cheng, Keane Lucas, Winnie Xiao, Nicholas Carlini, and Milad Nasr Introduction Claude Mythos Preview ’s ability to develop exploits is a step-change over previous frontier models. This was one of our primary motivations for rolling out the model carefully through Project Glasswing  rather than through a general release. Mythos Preview is capable of finding complex vulnerabilities, but what concerned us most in our internal testing  was that Mythos Preview could both turn vulnerabilities into exploit primitives, and combine those primitives together into complete end-to-end attack chains. When we published our Mythos Preview results , we measured its capabilities by having it search for novel zero-days and then build exploits for them. Qualitative evaluations like this are helpful for showcasing a model’s capabilities—but ideally, we would have high-quality quantitative benchmarks that let us measure them precisely. The problem we faced at the time we released Mythos Preview was that no existing public exploit benchmarks were difficult enough to capture Mythos Preview’s capabilities in our initial testing. Over the last month, however, we have seen the development of two new, more challenging academic benchmarks: ExploitBench  and ExploitGym . We collaborated with the researchers who produced these benchmarks to measure Mythos Preview’s performance, and also ran Mythos Preview on an updated version of SCONE-bench, a benchmark we developed in collaboration with MATS  and the Anthropic Fellows Program to measure smart contract exploitation. On all three benchmarks, we’ve found that Mythos Preview consistently outperforms all other evaluated models. We believe this is further evidence that the knowledge and expertise required to develop exploits will drop significantly as Mythos-level capabilities become more widely available. ExploitBench: V8 bugs ExploitBench  is a benchmark to study the exploit development capabilities of large language models. It’s built by Seunghyun Lee  and Prof. David Brumley  from Carnegie Mellon University and Bugcrowd. What makes this benchmark interesting is that it focuses on measuring the ability of language models to write complete end-to-end exploits. Prior benchmarks typically focused on measuring the ability of language models to write a “proof-of-concept” that shows the existence of a vulnerability. But a proof-of-concept only indicates that a bug is reproducible or reachable, not that an attacker could use it to actually cause harm. In ExploitBench, language models must build exploit primitives out of the vulnerability in order to enable new capabilities, such as granting the attacker arbitrary code execution (ACE). ExploitBench decomposes the exploit development process into 16 distinct capabilities. Each of these is verified programmatically, which allows fine-grained analysis of the different intermediate capabilities required to build working exploits. The 16 capabilities are divided into five capability tiers, forming a capability ladder: T5 Coverage (reaching the vulnerable code path); T4 Reproduction (constructing a proof-of-concept to trigger the bug); T3 Target primitives (creating primitives confined to the V8 sandbox); T2 Generic primitives (breaking the sandbox to get read/write or infoleaks across the process); T1 Full Control (hijacking control flow or getting arbitrary code execution).

Using this framework, the authors build a V8 benchmark, which uses a set of 41 (now patched) vulnerabilities in the V8 JavaScript and WebAssembly engine that are sourced from the V8 Exploit Tracker . The V8 engine  is widely used infrastructure, powering Chromium-derived applications (e.g., Chrome, Edge, Android WebView), Node.js environments (server backends), and Electron apps (e.g., VS Code, Slack, Discord). A key element of this framework is testing against security defenses: the V8 sandbox walls off the memory where a webpage’s JavaScript objects live, so that a V8 bug doesn’t become a foothold deeper into the browser. The highest scoring tier means arbitrary code execution in the entire V8 process (in a browser, this is like taking control over an entire tab). Given a vulnerable build of the V8 engine and the patch that fixes a given vulnerability, the language model is instructed to build an exploit for that bug. The exploits are then scored automatically against all 16 capabilities, with no human or LLM judge. Lower tiers are checked by differential execution against the patched build; higher tiers use challenge-response functions built into V8 that are replayed across multiple randomized heap layouts, so hardcoding a leaked address won't pass. A separate static scan of the transcripts flags other forms of cheating as a backstop. All models run on an identical ExploitBench harness with a 300 turn budget, which itself has two variants: Baseline and Nudged. In the Nudged variant, additional prompts are adaptively injected by the harness to warn the model to wrap up when close to the budget limit, or to encourage the model to use up its turn budget if it stops too early. Each variant is run for three trials. Anthropic ran all Claude models, and then provided all results and transcripts to the benchmark authors, who verified the results. Figure 1 : The highest tier achieved in 3 trials across 41 CVE environments and the Mean cap(abilities) achieved out of the 16 measured across all trials. Spend is calculated by API usage. Source: exploitbench.ai Figure 2: Cumulative number of environments out of 41 for which a model was able to reach a given capability tier for the Baseline variant. Figure 2: Cumulative number of environments out of 41 for which a model was able to reach a given capability tier for the Baseline variant. Consistent with our previous findings on Mozilla Firefox , all language models can reach or trigger the given vulnerabilities, but only models since Claude Opus 4.6 make any progress in developing primitives inside the V8 sandbox. Escaping the V8 sandbox, going from T3 to T2, is the next capability cliff; Mythos Preview is the only tested model that can reliably do so, which it does in over half the tested environments. It also achieves control flow hijack (T1) in almost half the environments in the Baseline variant. Combining Baseline and Nudged variants, Mythos Preview achieves ACE on...

Excerpt shown — open the source for the full document.

Additional captured pages
Exploit Evalscaptured 1w

EXPLOITBENCH: A CAPABILITY LADDER BENCHMARK FOR LLM CYBERSECURITY AGENTS A PREPRINT Seunghyun Lee Carnegie Mellon University Pittsburgh, PA 15213 David Brumley Carnegie Mellon University and Bugcrowd Pittsburgh, PA 15213 May 15, 2026 ABSTRACT Exploitation is not a binary event....

System Card: Claude Mythos Preview April 7, 2026 anthropic.com Abstract This System Card describes Claude Mythos Preview, a large language model from Anthropic. Claude Mythos Preview is our most capable frontier model to date, and shows a striking leap in scores on many...

Notability

notability 7.0/10

Notable anthropic research post on evals.