Speeding up agentic workflows with WebSockets in the Responses API
Captured source
source ↗Speeding up agentic workflows with WebSockets in the Responses API | OpenAI
April 22, 2026
Speeding up agentic workflows with WebSockets in the Responses API
By Brian Yu and Ashwin Nathan, Members of the Technical Staff
Loading…
Share
When you ask Codex to fix a bug, it scans through your codebase for relevant files, reads them to build context, makes edits, and runs tests to verify the fix worked. Under the hood, that means dozens of back-and-forth Responses API requests: determine the model’s next action, run a tool on your computer, send the tool output back to the API, and repeat.
All of these requests can add up to minutes that users spend waiting for Codex to complete complex tasks. From a latency perspective, the Codex agent loop spends most of its time in three main stages: working in the API services (to validate and process requests), model inference, and client-side time (running tools and building model context). Inference is the stage where the model runs on GPUs to generate new tokens. In the past, running LLM inference on GPUs was the slowest part of the agentic loop, so API service overhead was easy to hide. As inference gets faster, the cumulative API overhead from an agentic rollout is much more notable.
In this post, we'll explain how we made agent loops using the API 40% faster end-to-end, letting users experience the jump in inference speed from 65 to nearly 1,000 tokens per second. We approached this through caching, eliminating unnecessary network hops, improving our safety stack to quickly flag issues, and—most importantly—building a way to create a persistent connection to the Responses API, instead of having to make a series of synchronous API calls.
When the API became the bottleneck
In the Responses API, previous flagship models like GPT‑5 and GPT‑5.2 ran at roughly 65 tokens per second (TPS). For the launch of GPT‑5.3‑Codex‑Spark, a fast coding model, our goal was an order of magnitude faster: over 1,000 TPS, enabled by specialized Cerebras hardware optimized for LLM inference. To make sure users could experience the true speed of this new model, we had to reduce API overhead.
Around November of 2025, we launched a performance sprint on the Responses API, landing many optimizations to the critical-path latency for a single request:
- Caching rendered tokens and model configuration in memory to skip expensive tokenization and network calls for multi-turn responses
- Reducing network hop latency by eliminating calls to intermediate services (for example, image processing resolution) and directly calling the inference service itself
- Improving our safety stack so we could run certain classifiers to flag conversations faster
With these improvements, we saw close to a 45% improvement in time to first token (TTFT)—which reflects how responsive the API feels—but these improvements were still not fast enough for GPT‑5.3‑Codex‑Spark. Even with these improvements, Responses API overhead was too large relative to the speed of the model—that is, users had to wait for the CPUs running our API before they could use the GPUs serving the model.
The deeper issue was structural: we treated each Codex request as independent, processing conversation state and other reusable context in every follow-up request. Even when most of the conversation hadn't changed, we still paid for work tied to the full history. As conversations got longer, that repeated processing became more expensive.
Building a persistent connection
To tighten up the design, we rethought the transport protocol: could we keep a persistent connection and cache state, rather than establishing a new connection over HTTP and sending the full conversation history for each follow-up request? The idea was to only send any new information requiring validation and processing and cache reusable state in memory for the lifetime of the connection. This would reduce overhead from redundant work.
We considered a few different approaches, including WebSockets and gRPC bidirectional streaming. We landed on WebSockets because as a simple message transport protocol, users wouldn't have to change their Responses API input and output shapes. It was developer-friendly and fit our existing architecture with little disruption.
The first WebSocket prototype changed what we thought was possible for Responses API latency. An engineer on the Codex team with deep expertise across the API stack pulled together a prototype by running a Codex agent overnight.
In that prototype, agentic rollouts were modeled as a single long-running Response. Usingasyncio features, the Responses API would asynchronously block in the sampling loop after a tool call was sampled, and the Responses API would send aresponse.done event back to the client. After executing the tool call, clients would send back aresponse.append event with the tool result, which unblocked the sampling loop and let the model continue.
An analogy here is treating the local tool call as a hosted tool call. When the model calls web search, the inference loop blocks, calls a web search service, and puts the service response in the model context. In our design, we did the same thing; but instead of calling a remote service, we sent the model's tool call to the client back over the WebSocket. When the client responded, we put the client's tool call response into the context and continued to sample.
This design was extremely effective because it eliminated repeated API work across an agent rollout. We could do preinference work once, pause for tool execution, and do postinference work once at the end.
Unfortunately, this came at the cost of a less familiar and more complicated API shape. We wanted developers to be able to drop in WebSocket support without having to rewrite their API integration around a new interaction mode.
Keeping the API familiar while making the stack incremental
For the version we launched, we switched back to a familiar shape: keep usingresponse.create with the same body, and useprevious_response_id to continue the conversation context from the previous response’s state.
On a WebSocket connection, the server keeps a connection-scoped, in-memory cache of previous response state. When a follow-upresponse.create includesprevious_response_id, we fetch that state from the cache instead of rebuilding the full conversation from scratch.
That cached state includes:
- The previous
responseobject - Prior input and…
Excerpt shown — open the source for the full document.
Notability
notability 4.0/10Low traction blog post from OpenAI about API improvement