Why Codex Security Doesn’t Include a SAST Report
Captured source
source ↗Why Codex Security Doesn’t Include a SAST Report | OpenAI
March 16, 2026
Why Codex Security Doesn’t Include a SAST Report
Share
For decades, static application security testing (SAST) has been one of the most effective ways security teams scale code review.
But when we built Codex Security, we made a deliberate design choice: we didn’t start by importing a static analysis report and asking the agent to triage it. We designed the system to start with the repository itself—its architecture, trust boundaries, and intended behavior—and to validate what it finds before it asks a human to spend time on it.
The reason is simple: the hardest vulnerabilities usually aren’t dataflow problems. They happen when code appears to enforce a security check, but that check doesn’t actually guarantee the property the system relies on. In other words, the challenge isn’t just tracking how data moves through a program—it’s determining whether the defenses in the code really work.
The problem: SAST is optimized for dataflow
SAST is often framed as a clean pipeline: identify a source of untrusted input, track data through the program, and flag cases where that data reaches a sensitive sink without sanitization. It’s an elegant model, and it covers a lot of real bugs.
In practice, SAST has to make approximations to stay tractable at scale—especially in real codebases with indirection, dynamic dispatch, callbacks, reflection, and framework-heavy control flow. Those approximations aren’t a knock on SAST; they’re the reality of trying to reason about code without executing it.
That, by itself, is not why Codex Security doesn’t start with a SAST report.
The deeper issue is what happens after you successfully trace a source to a sink.
Where static analysis struggles: constraints and semantics
Even when static analysis correctly traces input across multiple functions and layers, it still has to answer the question that actually determines whether a vulnerability exists:
Did the defense really work?
Take a common pattern: code calls something likesanitize_html() before rendering untrusted content. A static analyzer can see that the sanitizer ran. What it usually can’t determine is whether that sanitizer is actually sufficient for the specific rendering context, template engine, encoding behavior, and downstream transformations involved.
That’s where things get tricky. The problem isn’t just whether data reaches a sink. It’s whether the checks in the code actually constrain the value in the way the system assumes they do.
Put differently: there’s a big difference between “the code calls a sanitizer” and “the system is safe.”
Example: validation before decoding
Here’s a pattern that shows up in real systems all the time.
A web application receives a JSON payload, extracts aredirect_url, validates it against an allowlist regex, URL-decodes it, and passes the result to a redirect handler.
A classic source-to-sink report can describe the flow:
untrusted input → regex check → URL decode → redirect
But the real question isn’t whether the check exists. It’s whether that check still constrains the value after the transformations that follow.
If the regex runs before decoding, does it actually constrain the decoded URL the way the redirect handler interprets it?
Answering that means reasoning about the entire transformation chain: what the regex allows, how decoding and normalization behave, how URL parsing treats edge cases, and how the redirect logic resolves schemes and authorities.
Many of the vulnerabilities that matter in practice look like this: order-of-operations mistakes, partial normalization, parsing ambiguities, and mismatches between validation and interpretation. The dataflow is visible. The weakness is in how constraints propagate—or fail to propagate—through the transformation chain.
This isn’t just a theoretical pattern. In CVE-2024-29041, Express was affected by an open redirect issue where malformed URLs could bypass common allowlist implementations because of how redirect targets were encoded and then interpreted. The dataflow was straightforward. The harder question—and the one that determined whether the bug existed—was whether the validation still held after the transformation chain.
Our approach: start from behavior, then validate
Codex Security is built around a simple goal: reduce triage by surfacing issues with stronger evidence. In the product, that means using repo-specific context (including a threat model) and validating high-signal issues in an isolated environment before surfacing them.
When Codex Security encounters a boundary that looks like “validation” or “sanitization,” it doesn’t treat that as a checkbox. It tries to understand what the code is attempting to guarantee—and then it tries to falsify that guarantee.
In practice, that tends to look like a mix of:
- Reading the relevant code path with full repository context, the way a security researcher would, and looking for mismatches between intent and implementation. This includes comments, but the model does not necessarily believe comments so adding //Halvar says: this is not a bug above your code does not confuse it, if there really is a bug.
- Reducing the problem to the smallest testable slice (for example, the transformation pipeline around a single input), so you can reason about it without the rest of the system in the way. In this sense, Codex Security pulls out tiny code slices and then writes micro-fuzzers for them.
- Reasoning about constraints across transformations, rather than treating each check independently. Where appropriate, this can include formalization as a satisfiability question. In other words, we give the model access to a Python environment with z3-solver and it is good at using it when needed, just as a human would have to when answering a particularly complicated input constraint problem. This is especially useful for looking at integer overflows or similar bugs on non-standard architectures.
- Executing hypotheses in a sandboxed validation environment when possible, to distinguish “this could be a problem” from “this is a problem”. There's no better proof than a full end-to-end PoC with the code compiled in debug mode.
This is the key shift: instead of stopping at “a check exists,” the system pushes toward “the invariant holds (or it doesn’t), and here’s the evidence”. And the model chooses the best tool for that job.
Why we don’t…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine blog post, low traction.