Meta’s Infrastructure Evolution and the Advent of AI
Captured source
source ↗Meta’s Infrastructure Evolution and the Advent of AI - Engineering at Meta
Skip to content
By Yee Jiun Song , Kaushik Veeraraghavan
https://engineering.fb.com/wp-content/uploads/2025/09/Meta-data-centers-AI-Infra-video_small.mp4
Over the past 21 years, Meta has grown exponentially from a small social network connecting a few thousand people in a handful of universities in the U.S. into several apps and novel hardware products that serve over 3.4 billion people throughout the world.
Our infrastructure has evolved significantly over the years, growing from a handful of software systems on a small fleet of servers in a few co-location facilities to a massive, globally networked operation. We faced numerous challenges along the way and developed innovative solutions to overcome them.
The advent of AI has changed all of our assumptions on how to scale our infrastructure. Building infrastructure for AI requires innovation at every layer of the stack, from hardware and software, to our networks, to our data centers themselves.
Facebook was built on the open source Linux, Apache, MySQL, and PhP (LAMP) stack. True to our roots, much of our work has been openly shared with the engineering community in the form of research papers or open source hardware and software systems. We remain committed to this open source vision and describe how we are committed to an open standards approach to silicon and hardware systems as we push the frontiers of computer science.
Scaling Our Infrastructure Stack (2004 – 2010)
In our earliest years, we focused our engineering work on scaling our software stack. As Facebook expanded from Harvard to other universities, each university got its own database. Students logging on to Facebook would connect to a set of common web servers that would in turn connect each student to their university’s database. We quickly realized that students wished to connect with their friends who attended other universities — this was the birth of our social graph that interconnected everyone on the social network.
As Facebook expanded beyond universities to high schools and then the general public, there was a dramatic increase in the number of people on our platform. We managed database load by scaling our Memcache deployments and then building entirely new software systems such as the TAO social graph , and a whole host of new caching and data management systems. We also developed a new ranking service for News Feed and a photo service for sharing photos and videos.
Soon, we were expanding beyond the US to Europe. Scaling our software systems was critical, but no longer sufficient. We needed to find other ways to scale. So we moved one layer below software and started scaling our physical infrastructure. We expanded beyond small co-location facilities in the Bay Area to a co-lo in Ashburn, Va. In parallel, we built out our first data centers in Prineville, Ore . and Forest City, N.C.
As our physical infrastructure scaled to multiple data centers, we ran into two new problems. First, we needed to connect our user base distributed across the US and Europe to our data centers. We tackled this problem by aggressively building out our edge infrastructure where we obtained some compute capacity beside every local internet service provider (ISP) and bought into the peering network that connected the ISP to our data centers. Second, we needed to replicate our entire software stack to each data center so that people would have the same experience irrespective of which actual physical location they connected to. This required us to build a high bandwidth, multipath backbone network that interconnected our data centers. Initially, this entailed building out our terrestrial fiber network to connect the various co-location facilities in California and Virginia to our new data centers in Oregon and North Carolina.
As our user base grew globally, we scaled beyond single data center buildings and into data center regions consisting of multiple buildings. We also aggressively built out our edge presence, where we now operate hundreds of points-of-presence (POPs) across the world.
The Challenges of Scaling (2010 – 2020)
Building out a global infrastructure also brought along all of the complex corner cases of computer science.
Cache Consistency
First, we needed to solve for cache consistency. We saw issues where people would receive notifications about being tagged in a photo, but couldn’t see the photo. Or people in a chat thread would receive messages out-of-order. These problems manifested because we were serving a fraction of our user base out of each data center region. People served out of the same region would receive notifications and see the right data, while people in a different region would experience a lag as the data update was replicated across our distributed fleet. This lag directly led to an inconsistent user experience. We solved these problems by building novel software systems that delivered cache invalidations, eventually building a consistency API for distributed systems .
Fleet management
As we added new data center regions and grew our machine fleet, we also had to develop new abstractions to manage them. This included systems and associated components like:
Twine : a cluster management system that scales to manage millions of machines in a data center region.
Tectonic : a data center scale distributed file system.
ZippyDB : a strongly consistent distributed key value store.
Shard Manager : a global system to manage tens of millions of shards of data, hosted on hundreds of thousands of servers for hundreds of applications.
Delos : a new control plane for our global infrastructure.
Service Router : to manage our global service mesh.
We developed the above systems, and many others, so we could operate a global fleet of millions of machines, while also providing excellent performance.
Masking hardware failure
More machines also implies a higher likelihood of failure. To address this, we worked to ensure that we could mask failures from users and provide a highly available and accessible service. We accomplished this by building new systems like:
Kraken : which leverages live traffic load tests to identify and resolve resource utilization bottlenecks.
Taiji : to manage user traffic load balancing.
Maelstrom : which handled data center-scale disasters safely and efficiently while minimizing user impact.
We continue to invest heavily…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Routine blog post, low traction
Meta AI (Llama) has a writing signal matching infrastructure, product and customer.