What does this writing signal mean?

OpenAI Writing: Understanding prompt injections: a frontier security challenge

Captured source

source ↗

openai.com/openai.com/index/prompt-injections

Understanding prompt injections: a frontier security challenge

Source ↗

published Nov 7, 2025seen Jun 5captured Jun 8http 200method exa

Understanding prompt injections: a frontier security challenge | OpenAI

November 7, 2025

Understanding prompt injections: a frontier security challenge

AI tools are starting to do more than respond to questions. They can now browse the web, help with research, plan trips, and help buy products. As they become more capable, with the ability to access your data in other apps and take actions on your behalf, new security challenges emerge. One we’re heavily focused on is prompt injection.

What is a prompt injection?

Prompt injection is a type of social engineering attack specific to conversational AI. Early AI systems were conversations between a single user and a single AI agent. In AI products today, your conversation may include content from many sources, including the internet. The idea that a third-party (that is not the user and not the AI) could mislead the model by injecting malicious instructions into the conversation context led to the term “prompt injection”.

In the same way that phishing emails or scams on the web attempt to trick people into giving away sensitive information, prompt injections attempt to trick AIs into doing something you did not ask for.

Imagine you have asked an AI to help you do some vacation research online, and while it is doing that it encounters misleading content or harmful instructions hidden on a webpage, such as in a comment on a listing or on a review. The content could be carefully crafted in an attempt to trick an AI into recommending the wrong listing, or worse, to steal your credit card information.

These are just a few examples of “prompt injection” attacks—harmful instructions designed to trick an AI into doing something you didn’t intend, often hidden inside ordinary content such as a web page, document, or email.

These risks increase as AIs have access to more sensitive data and take on more initiative and longer tasks.

Summary

What you asked the AI to do

What the attacker does

Potential result if the attack succeeds

You ask an AI to research apartments, and it is prompt injected to recommend a listing that isn’t the best option for you.

You ask an AI to research apartments with some given criteria.

The attacker has included a prompt injection attack in the apartment listing to trick the AI into thinking that their listing needs to be picked regardless of the user’s stated preferences.

If the attack succeeds, the AI may incorrectly recommend a sub-optimal apartment listing based on your preferences.

You ask an AI agent to respond to your emails from overnight, and it ends up sharing your bank statements.

You ask an AI agent to generally respond to your emails from overnight because you are busy this morning.

See “When possible, give an agent explicit instructions” below

The attacker sent you an email that includes misinformation that tricks the model to find your bank statements and share them with the attacker.

If the attack succeeds, the agent may look for anything like bank statements in your email (which you gave access to for the task) and will share them with the attacker.

Our approach to protecting users

Defending against prompt injection is a challenge across the AI industry and a core focus at OpenAI. While we expect adversaries to continue developing such attacks, we’re building defenses designed to carry out the user’s intended task even when someone is actively trying to mislead them. That capability is essential to safely realizing the benefits of AGI.

To protect our users, and to help improve our models against these attacks, we take a multi-layered approach, including the following:

Safety training

We want AI that recognizes prompt injections and doesn’t fall for them. However, robustness to adversarial attacks is a long-standing challenge for machine learning and AI, making this a hard, open problem. We have developed research called Instruction Hierarchy⁠ to work towards models distinguishing between instructions that are trusted and untrusted. We continue to develop new approaches to train models to better recognize prompt injection patterns so they can ignore them or flag them to users. One of the techniques we apply is automated red-teaming, an area we've been studying⁠ for years, to develop novel prompt injection attacks.

Monitoring

We have developed multiple automated AI-powered monitors⁠ to identify and block prompt injection attacks. These complement the safety training approaches because they can be updated rapidly to quickly block any new attacks we uncover. These monitors not only help identify potential prompt injection attacks against our users, but can also allow us to catch adversarial prompt injection research and testing using our platform, before those attacks are deployed in the wild.

Security protections

We have designed our products and infrastructure with various overlapping security protections to help safeguard user data. These features, which we will explore in more technical detail in future posts, are tailored on a per-product basis. For example, to help you avoid untrusted sites, we will ask you to approve certain links in ChatGPT, especially on websites that ask us not to catalogue them⁠, before they can be visited. When our AI uses tools to run other programs or code (as in Canvas, or our development tool Codex), we use a technique called sandboxing to prevent the model from making harmful changes that might be the result of a prompt injection.

Give users control

We include built-in controls in our products to help users protect themselves. For example, in ChatGPT Atlas, you can select logged-out mode which allows ChatGPT agent to start tasks without being logged-in to sites. ChatGPT agent also pauses and asks for confirmation prior to taking sensitive steps such as completing a purchase. When agent is operating on sensitive sites, we have also implemented a “Watch Mode” that alerts you to the sensitive nature of the site and requires you have the tab active to watch the agent do its work. Agent will pause if you move away from the tab with sensitive information. This ensures you stay aware—and in control—of what actions the agent is performing.

Red-teaming

We perform extensive red-teaming with internal and external teams to test and improve our defenses, emulate attacker behavior, and find new ways to improve our security. This includes thousands of hours focused specifically on prompt injection. As we have...

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable research post from OpenAI on security

OpenAI has a writing signal matching infrastructure, safety and policy.

Infrastructure Safety and policy