WritingCloudflare (Workers AI)Cloudflare (Workers AI)published May 28, 2026seen 5d

How we built Cloudflare's data platform and an AI agent on top of it

Open original ↗

Captured source

source ↗
published May 28, 2026seen 5dcaptured 3dhttp 200method plain

How we built Cloudflare's data platform and an AI agent on top of it How we built Cloudflare's data platform and an AI agent on top of it 2026-05-28 Brian Brunner

Dmitry Alexeenko

Matt Moen

12 min read This post is also available in 日本語 and 한국어 . Cloudflare processes more than a billion events every second. Our network spans 330+ cities in 120+ countries. Behind every HTTP request, every Worker invocation, every R2 read operation, there is data, and a lot of it. For years, that data was not very easy to access. It lived in dozens of production databases, ClickHouse clusters, Kafka streams, Google Cloud buckets, BigQuery datasets, and a long tail of pipelines. To answer a simple question like "How many domains that signed up today are in the Top 100 by traffic?", an analyst at Cloudflare had to know which system to ask, what credentials to use, what query language to write, and whether the data they were looking at was sampled, fresh, or seven-days stale. As a result, it was difficult to glean informed insights from the data. To solve this problem, we built two in-house tools: Town Lake, Cloudflare's unified data analytics platform, and Skipper, an AI data agent that runs on top of it. Town Lake is a single SQL interface to everything Cloudflare knows, and Skipper is how anyone at Cloudflare can ask questions in plain English and get correct, auditable answers back in seconds. This is the story of how we built both.

The shape of the problem

If you have ever worked at a company that went through a hyper-growth period, you know what data sprawl looks like. Ours had a few specific symptoms: Too many disparate systems. A product engineer who wanted to investigate a customer issue might need to query Postgres for account metadata, ClickHouse for analytics events, BigQuery for usage rollups, R2 for raw logs, and Kafka topics for real-time signals. Each system had its own credentials, its own language, and its own retention policy.

Sampled data. This is fine for dashboards, but doesn’t work for domains like billing. Our analytics pipeline downsamples to handle 700M+ events per second. That is the right behavior when you want an analytics dashboard to load, but it’s exactly the wrong behavior when you are trying to compute someone’s usage required to issue an invoice.

External dependencies for internal data. Parts of our previous internal reporting stack were powered by external vendors. Beyond the cost, we had a hard external dependency on another cloud for some of our critical data.

No one could find the data. Even if you had all the right credentials, you needed to know that the right table for "Billable Workers requests by account" lived in a specific ClickHouse cluster, in a specific schema, joined to a specific Postgres dimension table, and that the join required an obscure customer ID translation. There was too much tribal knowledge.

We had a cultural challenge too: data infrastructure had historically been treated as a back-office function that was in service of the business, rather than critical infrastructure in its own right.

What we wanted

We wanted to create one place where anyone at the company with appropriate permissions and a need to know could get answers to questions about Cloudflare: “Show me the top 100 customers by revenue in the last quarter”, “List all Bot Management ML scoring events with score > 0.9 in the last 48 hours coming from a specific ASN”, “Find the Top 100 billing support tickets from customers who have spent >$100”, etc. We wanted that place to give fresh, accurate, unsampled data for the queries that need it (like billing or security investigations) and fast, downsampled data for the queries that don't (like dashboards or exploration). We wanted security and governance baked in, with personally identifiable information (PII) detected automatically, and sensitive tables locked down by default. All access should be auditable, and have time-bounded permission grants so that users could only access data when they were actively working on tasks that required it. We wanted it to be built on Cloudflare’s own platform: R2 for storage, Workers for compute, Cloudflare Access for authentication, Workflows for orchestration. If we were going to make a major investment in our data infrastructure, it was going to be built on the same products we sell to customers. And we wanted, eventually, an interface that did not require knowing any SQL. The goal was to empower anyone at the company with appropriate permissions and a need to know to look at the stream of data flowing through our network, not just analysts. That last requirement is what became Skipper.

Town Lake, the platform

At its core, our data platform’s architecture is a data lakehouse : a query engine that reads from object storage, with a metadata layer that makes the storage behave like a database. We call it Town Lake, after its namesake in Austin, Texas.

Its most important components are: Query engine. We chose Apache Trino for that: a single SQL query can join a Postgres table, a ClickHouse table, and an Iceberg table on R2 without a need to materialize the intermediate results into a different system. A query that asks "what are the top 100 paying customers by Workers requests this week" compiles into a plan that pushes filters into ClickHouse, joins against an account dimension in Postgres, and ranks against billing rollups in R2, all in one go. R2 Data Catalog, our managed Apache Iceberg service, is where the cold and warm data lives. Iceberg gives us schema evolution, time travel, partition evolution, and the ability to compact data as it ages. Per-minute usage from last week becomes hourly, hourly from last quarter becomes daily, etc. The storage cost decreases as recency does, while the data stays queryable. Parquet files in R2 are much cheaper compared to keeping the same data in an OLAP database. DataHub is our metadata catalog. Every table, column, owner, lineage edge, and glossary term lives there. When a user asks "what's in townlake.dim.accounts ," DataHub provides an answer, including the table description, the column descriptions, the owning team, the upstream tables that feed it, and the downstream tables that consume it. Lifeguard is our access control service: it stores access rules in D1, dynamically pulls user and group membership from our internal access management system, and renders a combined JSON policy that Trino reads over HTTP. Lifeguard also feeds basic…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Low-traction blog post, not a notable release