Continuously hardening ChatGPT Atlas against prompt injection
Captured source
source ↗Continuously hardening ChatGPT Atlas against prompt injection attacks | OpenAI
December 22, 2025
Continuously hardening ChatGPT Atlas against prompt injection attacks
Automated red teaming—powered by reinforcement learning—helps us proactively discover and patch real-world agent exploits before they’re weaponized in the wild.
Loading…
Share
Agent mode in ChatGPT Atlas is one of the most general-purpose agentic features we’ve released to date. In this mode, the browser agent views webpages and takes actions, clicks, and keystrokes inside your browser, just as you would. This allows ChatGPT to work directly on many of your day-to-day workflows using the same space, context, and data.
As the browser agent helps you get more done, it also becomes a higher-value target of adversarial attacks. This makes AI security especially important. Long before we launched ChatGPT Atlas, we’ve been continuously building and hardening defenses against emerging threats that specifically target this new “agent in the browser” paradigm. Prompt injection is one of the most significant risks we actively defend against to help ensure ChatGPT Atlas can operate securely on your behalf.
As part of this effort, we recently shipped a security update to Atlas’s browser agent, including a newly adversarially trained model and strengthened surrounding safeguards. This update was prompted by a new class of prompt-injection attacks uncovered through our internal automated red teaming.
In this post, we explain how prompt-injection risk can arise for web-based agents, and we share a rapid response loop we’ve been building to continuously discover new attacks and ship mitigations quickly—illustrated by this recent security update.
We view prompt injection as a long-term AI security challenge, and we’ll need to continuously strengthen our defenses against it (much like ever-evolving online scams that target humans). Our latest rapid response cycle is showing early promise as a critical tool on that journey: we’re discovering novel attack strategies internally before they show up in the wild. Our long-term vision is to fully leverage (1) our white-box access to our models, (2) deep understanding of our defenses, and (3) compute scale to stay ahead of external attackers—finding exploits earlier, shipping mitigations faster, and continuously tightening the loop. Combined with frontier research on new techniques to address prompt injection and increased investment in other security controls, this compounding cycle can make attacks increasingly difficult and costly, materially reducing real-world prompt-injection risk. Ultimately, our goal is for you to be able to trust a ChatGPT agent to use your browser the way you’d trust a highly competent, security-aware colleague or friend.
Prompt injection as an open challenge for agent security
A prompt injection attack targets AI agents by embedding malicious instructions into content the agent processes. Those instructions are crafted to override or redirect the agent’s behavior—hijacking it into following an attacker’s intent, rather than the user’s.
For a browser agent like the one inside ChatGPT Atlas, prompt injection adds a new threat vector beyond traditional web security risks (like user error or software vulnerabilities). Instead of phishing humans or exploiting system vulnerabilities of the browser, the attacker targets the agent operating inside it.
As a hypothetical example, an attacker could send a malicious email attempting to trick an agent to ignore the user’s request and instead forward sensitive tax documents to an attacker-controlled email address. If a user asks the agent to review unread emails and summarize key points, the agent may ingest that malicious email during the workflow. If it follows the injected instructions, it can go off-task—and wrongly share sensitive information.
This is just one specific scenario. The same generality that makes browser agents useful also makes the risks broader: the agent may encounter untrusted instructions across an effectively unbounded surface area—emails and attachments, calendar invites, shared documents, forums, social media posts, and arbitrary webpages. Since the agent can take many of the same actions a user can take in a browser, the impact of a successful attack can hypothetically be just as broad: forwarding a sensitive email, sending money, editing or deleting files in the cloud, and more.
We’ve made progress defending against prompt injection through multiple layers of safeguards, as we shared in an earlier post. However, prompt injection remains an open challenge for agent security, and one we expect to continue working on for years to come.
Automated prompt injection attack discovery through end-to-end and high-compute reinforcement learning
To strengthen our defenses, we’ve been continuously searching for novel prompt injection attacks against agent systems in production. Finding these attacks is a necessary prerequisite for building robust mitigations: it helps us understand real-world risk, exposes gaps in our defenses, and drives concrete patches.
To do this at scale, we built an LLM-based automated attacker and trained it to hunt for prompt injection attacks that can successfully attack a browser agent. We trained this attacker end-to-end with reinforcement learning, so it learns from its own successes and failures to improve its red teaming skills. We also let it “try before it ships”, by which we mean: during its chain of thought reasoning, the attacker can propose a candidate injection and send it to an external simulator. The simulator runs a counterfactual rollout of how the targeted victim agent (the defender) would behave if it encountered the injection, and returns a full reasoning and action trace of the victim agent. The attacker uses that trace as feedback, iterates on the attack, and reruns the simulation—repeating this loop multiple times before committing to a final attack. This provides richer in-context feedback to the attacker than a single pass/fail signal. It also scales up the attacker’s test-time compute. Moreover, privileged access to the reasoning traces (that we don’t disclose to external users) of the defender gives our internal attacker an asymmetric advantage—raising the odds that it can outrun external adversaries.
Why reinforcement learning (RL)? We chose reinforcement learning to train the automated attacker for multiple reasons:
1.…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Low traction routine security post