WritingOpenAIOpenAIpublished May 7, 2026seen 6d

Parloa builds service agents customers want to talk to

Open original ↗

Captured source

source ↗
published May 7, 2026seen 6dcaptured 3dhttp 200method exa

Parloa builds service agents customers want to talk to | OpenAI

May 7, 2026

Startup

Parloa builds service agents customers want to talk to

Parloa uses OpenAI models to simulate, evaluate, and run voice-driven customer service systems for the enterprise.

Company size: Startup

Region: Europe & UK, Global

Industry: Technology

Products: API

Loading…

Share

In Parloa’s early days, Co-founder Stefan Ostwald spent a day inside an insurance call center, where his team had been building early voice experiences. Sitting alongside agents, he listened to the same conversations play out again and again: password resets, policy questions, routine changes. He realized much of that work could be automated.

After that experience, Berlin-based Parloa⁠ began building rule-based voice agents to automate high-volume customer interactions.

With the emergence of ChatGPT, the company evolved to build what is now its AI Agent Management Platform (AMP), built on a new generation of models including GPT‑5.4.

AMP gives enterprises a way to design, deploy, and manage customer service interactions at scale. Instead of mapping out rigid intents and flows, teams define behavior in natural language, connect to internal systems, and iterate quickly using built-in simulations and evaluations.

Parloa runs these interactions end to end, handling everything from simple routing to complex, multi-step requests. The focus is on consistency in production, where performance, latency, and edge cases all matter. To get there, Parloa continuously tests models against real customer scenarios before deploying them.

> “The models only matter if they work in production. We work closely with OpenAI on how to make the models fast and reliable enough for real-time conversations.”

—Ciaran O’Reilly Ibañez, Engineering Manager at Parloa

Designing AMP for enterprise builders

Parloa’s Agent Management Platform (AMP) is designed for business users and subject matter experts to be able to build AI agents without writing code.

“With AMP, we can have subject matter experts from different business units actually build the agents and connect the APIs in a much leaner and simpler way,” says O’Reilly.

At a high level, AMP allows brands to manage the entire AI agent lifecycle. It does that by giving non-technical teams a simpler way to define how an agent should behave before it ever goes live. Instead of writing code or mapping rigid intent trees, subject matter experts set the agent’s role, instructions, tools, and boundaries in natural language. That configuration becomes the basis for how the model is prompted and how the system behaves in production.

Once defined, the agent is tested before deployment. Parloa simulates customer conversations using models like GPT‑5.4, with one model acting as the caller and another running the configured agent. Teams can inspect these interactions directly, test changes against realistic scenarios, and iterate before going live.

The same models are then used to evaluate those conversations using a mix of deterministic checks and LLM-as-a-judge scoring. This shows whether the agent followed instructions, used tools correctly, and completed the task as expected.

During a live conversation, AMP’s orchestration layer prompts an OpenAI model with the agent configuration and conversation context to generate a response, retrieve information through RAG, or trigger tools to interact with customer backends. Parloa continuously updates this layer with the latest generation of models as they demonstrate clear gains in real world performance.

After the conversation, separate OpenAI-powered workflows summarize the interaction, classify customer intent, and evaluate performance against defined rules.

As agents became more complex, maintaining a single, monolithic prompt became harder. Small changes could introduce unintended side effects. To address this, Parloa introduced a modular approach. Tasks like authentication, booking changes, or account updates can be separated into distinct sub-agents, improving instruction-following and making systems easier to evolve over time.

At the same time, the platform incorporates deterministic controls where reliability matters most. Enterprises can define structured API chains and event-based logic to ensure critical steps happen in the right order, balancing conversational flexibility with predictable execution.

Parloa uses models like GPT‑4.1, GPT‑5‑mini, and others to simulate realistic customer interactions before an agent ever goes live, then evaluates those interactions using a combination of LLM-as-a-judge and deterministic rules. This allows teams to test edge cases, iterate quickly, and validate performance before exposing customers to failure.

An evaluation-first approach

Parloa works primarily with large enterprises, where consistency matters as much as capability.

“When a new model comes out, we run our benchmarking suite against it,” says Matthäus Deutsch, Senior Applied Scientist. “It’s very important for us that things do not only work in theoretical benchmarks but in actual real use cases.”

Instead of relying on abstract benchmarks, Parloa mirrors real production agents and runs them through simulation and evaluation pipelines. These tests measure instruction-following reliability, API-calling consistency, latency, and overall performance under realistic conditions.

These evaluations determine which models are ready for production. Only models that perform reliably across real customer scenarios are deployed.

“Enterprise customers face a real migration cost,” says Deutsch. “Once a system is working in production, they keep it stable and only switch when the benefits are clear.”

As a result, systems behave predictably in production, even at scale. Across millions of customer interactions, most conversations are resolved without friction. Even when calls are routed to human agents, escalation is rarely driven by failure. In one deployment, a global travel company reduced requests for a human agent by 80%.

This evaluation-first mindset has become a core differentiator, allowing Parloa to move quickly without sacrificing reliability in production.

Building for voice at global scale

Voice introduces a different set of constraints than text-based chat. Every interaction runs through a low-latency pipeline: speech-to-text, model reasoning, and text-to-speech.

This pipeline makes latency critical. Even small delays in the…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine customer success story

OpenAI has a writing signal matching data demand, product and customer.