Computer-Using Agent
Captured source
source ↗Computer-Using Agent | OpenAI
January 23, 2025
Computer-Using Agent
Powering Operator with Computer-Using Agent, a universal interface for AI to interact with the digital world.
Loading…
Share
Today we introduced a research preview of Operator, an agent that can go to the web to perform tasks for you. Powering Operator is Computer-Using Agent (CUA), a model that combines GPT‑4o's vision capabilities with advanced reasoning through reinforcement learning. CUA is trained to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields people see on a screen—just as humans do. This gives it the flexibility to perform digital tasks without using OS-or web-specific APIs.
CUA builds off of years of foundational research at the intersection of multimodal understanding and reasoning. By combining advanced GUI perception with structured problem-solving, it can break tasks into multi-step plans and adaptively self-correct when challenges arise. This capability marks the next step in AI development, allowing models to use the same tools humans rely on daily and opening the door to a vast range of new applications.
While CUA is still early and has limitations, it sets new state-of-the-art benchmark results, achieving a 38.1% success rate on OSWorld for full computer use tasks, and 58.1% on WebArena and 87% on WebVoyager for web-based tasks. These results highlight CUA’s ability to navigate and operate across diverse environments using a single general action space.
We’ve developed CUA with safety as a top priority to address the challenges posed by an agent having access to the digital world, as detailed in our Operator System Card. In line with our iterative deployment strategy, we are releasing CUA through a research preview of Operator at operator.chatgpt.com for Pro Tier users in the U.S. to start. By gathering real-world feedback, we can refine safety measures and continuously improve as we prepare for a future with increasing use of digital agents.
How it works
CUA processes raw pixel data to understand what’s happening on the screen and uses a virtual mouse and keyboard to complete actions. It can navigate multi-step tasks, handle errors, and adapt to unexpected changes. This enables CUA to act in a wide range of digital environments, performing tasks like filling out forms and navigating websites without needing specialized APIs.
Given a user’s instruction, CUA operates through an iterative loop that integrates perception, reasoning, and action:
- Perception: Screenshots from the computer are added to the model’s context, providing a visual snapshot of the computer's current state.
- Reasoning: CUA reasons through the next steps using chain-of-thought, taking into consideration current and past screenshots and actions. This inner monologue improves task performance by enabling the model to evaluate its observations, track intermediate steps, and adapt dynamically.
- Action: It performs the actions—clicking, scrolling, or typing—until it decides that the task is completed or user input is needed. While it handles most steps automatically, CUA seeks user confirmation for sensitive actions, such as entering login details or responding to CAPTCHA forms.
Evaluations
CUA establishes a new state-of-the-art in both computer use and browser use benchmarks by using the same universal interface of screen, mouse, and keyboard.
| Benchmark type | Benchmark | Computer use (universal interface) | Web browsing agents | Human | | --- | --- | --- | --- | --- | | OpenAI CUA | Previous SOTA | Previous SOTA | | Computer use | OSWorld | 38.1% | 22.0% | - | 72.4% | | Browser use | WebArena | 58.1% | 36.2% | 57.1% | 78.2% | | WebVoyager | 87.0% | 56.0% | 87.0% | - |
Evaluation details are described here
Browser use
WebArena and WebVoyager are designed to evaluate the performance of web browsing agents in completing real-world tasks using browsers. WebArena utilizes self-hosted open-source websites offline to imitate real-world scenarios in e-commerce, online store content management (CMS), social forum platforms, and more. WebVoyager tests the model’s performance on online live websites like Amazon, GitHub, and Google Maps.
In these benchmarks, CUA sets a new standard using the same universal interface that perceives the browser screen as pixels and takes action through mouse and keyboard. CUA achieved a 58.1% success rate on WebArena and an 87% success rate on WebVoyager for web-based tasks. While CUA achieves a high success rate on WebVoyager, where most tasks are relatively simple, CUA still needs more improvements to close the gap with human performance on more complex benchmarks like WebArena.
001User prompt002Initializing computer003New screenshot004Accessing Cambridge Dictionary Plus section005Click006New screenshot007Closing advertisement pop-up for access008Click009New screenshot010Wait011New screenshot012Searching for grammar quizzes available013Scroll014New screenshot015Clicking button to access grammar quizzes016Click017New screenshot018Scrolling for recommended grammar quiz019Scroll020New screenshot021Scroll022New screenshot023Scrolling up to find quizzes024Scroll025New screenshot026Scroll027New screenshot028Exploring menu for recommended quizzes029Click030New screenshot031Selecting grammar option for quiz032Click033New screenshot034Scrolling for grammar quizzes options035Scroll036New screenshot037Closing pop-up about merchandise.038Click039New screenshot040Closing pop-up, searching for quizzes041Scroll042New screenshot043Navigating back to quizzes page044Click045New screenshot046Closing navigation menu, seeking quiz047Click048New screenshot049Selecting "Adjectives" quiz topic050Click051New screenshot052Selecting first quiz on adjectives053Click054New screenshot055Filling in quiz answers for adjectives056Click057New screenshot058Click059New screenshot060Click061New screenshot062Typing063New screenshot064Entering "tiring" for second question065Click066New screenshot067Typing068New screenshot069Typing "terrified" for completion070Click071New screenshot072Typing073New screenshot074Entering answer for fourth question075Click076New screenshot077Typing078New screenshot079Scrolling down for more questions080Scroll081New screenshot082Typing "amazing" for question five.083Click084New screenshot085Click086New screenshot087Click088New screenshot089Click090New screenshot091Click092New screenshot093Typing094New screenshot095Filling in sixth…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable agent research from OpenAI.