WritingMeta AI (Llama)Meta AI (Llama)published Mar 11, 2026seen 2h

Four MTIA Chips in Two Years: Scaling AI Experiences for Billions

Open original ↗

Captured source

source ↗

Four MTIA Chips in Two Years: Scaling AI Experiences for Billions

Products AI Research Resources About Get Llama Try Meta AI

FEATURED Hardware Four MTIA Chips in Two Years: Scaling AI Experiences for Billions March 11, 2026 • 17 minute read

Every day, billions of people on Meta’s platforms enjoy an array of AI-powered experiences ranging from personalized recommendations to AI assistants. Meanwhile, the AI models that will define the next era of computing are evolving faster than any single hardware generation can anticipate. Serving a wide range of AI models on a global scale, while maintaining the lowest possible costs, is one of the most demanding infrastructure challenges in the industry. Our response is to define the path forward — delivering flexible solutions today and improving them continuously as needs evolve. While we remain committed to a diverse silicon portfolio and to leveraging the best solutions available — both internally and externally — the Meta Training and Inference Accelerator (MTIA), our family of homegrown AI chips developed in close partnership with Broadcom, has remained and will continue to be an important part of Meta’s AI infrastructure strategy . MTIA plays an important role in cost-effectively powering AI experiences for the billions of people who use Meta’s products.

The Past and Future of MTIA We have published research papers at ISCA’23 and ISCA’25 detailing the first two generations of MTIA chips: MTIA 100 and MTIA 200 (formerly known as MTIA 1 and MTIA 2i). More importantly, we have deployed hundreds of thousands of MTIA chips in production, onboarded numerous internal production models, and tested MTIA with large language models (LLMs) like Llama. Since introducing MTIA 100 and 200, we have accelerated MTIA development across four successive generations: MTIA 300, 400, 450, and 500. These new chips have either already been deployed or are scheduled for deployment in 2026 or 2027, expanding workload coverage from ranking and recommendation (R&R) inference to R&R training, general GenAI workloads, and GenAI inference with targeted optimizations. AI models are evolving faster than traditional chip development cycles. Chip designs are based on projected workloads, but by the time the hardware reaches production — often two years later — those workloads may have shifted substantially. Rather than placing a bet and waiting for a long period of time, we deliberately take an iterative approach: Each MTIA generation builds on the last, using modular chiplets, incorporating the latest AI workload insights and hardware technologies, and deploying on a shorter cadence. This tighter loop keeps our hardware better aligned with evolving models while enabling faster adoption of new technology. The MTIA family now includes: MTIA 300 : Initially optimized for R&R models — the dominant Meta workload before GenAI took off — its building blocks established a strong foundation for subsequent chips optimized for GenAI models. It is in production for R&R training. MTIA 400 : As GenAI surged, MTIA 300 evolved into MTIA 400 to better support GenAI models, while maintaining the capabilities for supporting R&R workloads. Featuring a 72-accelerator scale-up domain, MTIA 400 delivers high performance that is competitive with leading commercial products. We have finished testing MTIA 400 in our labs and are on the path to deploying it in our data centers. MTIA 450 : Anticipating the rise in GenAI inference demand, MTIA 400 transitioned into MTIA 450, with specific optimizations for GenAI inference. Since the bandwidth of high-bandwidth memory (HBM) is the most important factor affecting GenAI inference performance, we doubled HBM bandwidth from MTIA 400 to 450, making it much higher than that of existing leading commercial products. Additionally, we introduced low-precision data types co-designed for inference workloads. MTIA 450 is scheduled for mass deployment in early 2027. MTIA 500 : Continuing the focus on GenAI inference, MTIA 500 increased HBM bandwidth by an additional 50% compared to MTIA 450 and introduced further innovations in low-precision data types. MTIA 500 is scheduled for mass deployment in 2027.

The Evolution of MTIA Chips

From MTIA 300 to MTIA 500, the HBM bandwidth increases by 4.5x and the compute FLOPS increases by 25x (from MTIA300’s MX8 to MTIA500’s MX4), as shown in the chip specifications below. This rapid advancement in less than two years highlights the benefits of our velocity strategy.

*Some vendors report bidirectional bandwidth. Multiply the value in the table by two to obtain the corresponding bidirectional bandwidth. **MTIA 300 is configured with a scale-out network with higher bandwidth (200 GB/s) due to its relatively small scale-up domain size and the target R&R workloads.

MTIA 300: A Cost-Effective Foundation Compared with earlier generations , MTIA 300’s distinguishing features include built-in NIC chiplets, dedicated message engines for offloading communication collectives, and near-memory compute for reduction-based collectives. Although initially optimized for R&R training, these low-latency, high-bandwidth communication components have provided the foundation for efficient GenAI inference and training in subsequent MTIA chips. MTIA 300 comprises one compute chiplet, two network chiplets, and several HBM stacks. Each compute chiplet comprises a grid of processing elements (PEs), with some redundant PEs to improve yield. Each PE contains: Two RISC-V vector cores. Dot Product Engine for matrix multiplication. Special Function Unit for activations and elementwise operations. Reduction Engine for accumulation and inter-PE communication. DMA engine for data movement in and out of local scratch memory.

Please refer to our ISCA’25 paper for more details on the aforementioned PE components.

MTIA 400: Competitive Raw Performance

As GenAI took off, we evolved MTIA 300 into MTIA 400 to better support GenAI workloads in addition to R&R workloads. MTIA 400 is a major improvement over MTIA 300, with 400% higher FP8 FLOPS and 51% higher HBM bandwidth. While MTIA 300 is a cost-effective product, MTIA 400 is the first MTIA chip designed to deliver not only cost savings but also raw performance competitive with leading commercial products. It combines two compute chiplets to double compute density, and also supports enhanced versions of MX8 and MX4, which are important low-precision formats for efficient GenAI inference. A rack with...

Excerpt shown — open the source for the full document.

Additional captured pages

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta KernelEvolve Team, Meta Platforms Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges –...

Notability

notability 7.0/10

Meta's custom AI chip progress for scaling.