WritingOpenAIOpenAIpublished May 5, 2026seen 6d

Unlocking large scale AI training networks with MRC (Multipath Reliable Connection)

Open original ↗

Captured source

source ↗

Supercomputer networking to accelerate large scale AI training | OpenAI

May 5, 2026

Supercomputer networking to accelerate large scale AI training

Loading…

Share

Frontier model training depends on reliable supercomputer networks that can quickly move data between GPUs. To make this faster and more efficient, OpenAI has partnered with AMD, Broadcom, Intel, Microsoft, and NVIDIA to develop MRC (Multipath Reliable Connection): a novel protocol that improves GPU networking performance and resilience in large training clusters. We released MRC today⁠ through the Open Compute Project (OCP) to enable the broader industry to use it.

With more than 900M people using ChatGPT every week, our systems are becoming core infrastructure for AI, helping people and businesses around the world build with increasingly capable models. Prior to the inception of Stargate⁠, we co-developed, brought up, and maintained our first three generations of supercomputers with great care and close collaboration with our partners over the span of a few years. This invaluable experience informed our strong belief that, to efficiently use compute at the scale of Stargate and succeed in our mission, we need to rethink and drastically reduce complexity in every layer of the stack – including network design.

Publishing the MRC specification is part of OpenAI’s overall compute strategy: shared standards in key infrastructure layers can help scale AI systems more efficiently, reliably, and across a broader partner ecosystem. In this post, we’ll cover the design of MRC, including: i) how it enables us to build multi-plane high-speed networks to create redundancy to ride out network failures, while using fewer components and less power ii) how MRC’s adaptive packet spraying virtually eliminates core congestion and iii) how our deployments use static source routing to bypass failures and eliminate whole classes of routing failure. In concert, these benefits allow us to deliver better models to everyone faster.

Why networks needed a new design

When training large AI models, a single step can involve many millions of data transfers. One transfer arriving late can ripple through the entire job, potentially causing GPUs to sit idle. Network congestion, link, and device failures are the most common sources of delay and jitter in transfers.

These problems get more frequent, and harder to solve, as the size of the cluster increases. This makes networking technology a key part of the design of Stargate.

To enable the current scale of Stargate supercomputers, we faced two key networking challenges. First, whenever possible, we should minimize the possibility of network congestion. There are unavoidable bottlenecks, such as two GPUs sending to the same destination at the same time. But outside of these cases, we should avoid congestion through design.

Second, we need to minimize the effect of network failures on the training job itself. At large enough scale, even the best network will have a constant background level of link and switch failures. Previously, a single failure would often cause a training job to crash, forcing a restart from a saved checkpoint, or stall progress for many seconds while the network recomputed routes. Such interruptions are costly in both GPU cycles and time. With synchronous pretraining – where many GPUs across many computers cooperate in lockstep to train one AI model – this is especially true. The larger the job we run, the greater the impact of any single link flap or failure. These workloads act as a form of “failure amplifier,” so preventing this has become critical.

Our answer: MRC

Our goal was not just to build a fast network, but also to build one that delivers very predictable performance, even in the presence of failures, to keep training jobs moving.

To achieve this reliability, our Scaling team worked with AMD⁠, Broadcom⁠, Intel, Microsoft⁠, and NVIDIA⁠ over the past two years to develop a new way to build and operate our networks. The result of this effort is a technique we call Multipath Reliable Connection, or MRC⁠. It’s a new network protocol built into the latest 800Gb/s network interfaces that lets us spread a single transfer across hundreds of paths, route around failures in microseconds, and run simpler network control planes.

MRC extends RDMA over Converged Ethernet (RoCE) – an InfiniBand Trade Association (IBTA) standard that enables hardware-accelerated remote direct memory access among GPUs and CPUs. It draws on techniques developed by the Ultra Ethernet Consortium (UEC) and extends them with SRv6-based source routing to support large-scale AI networking fabrics.

MRC is already deployed across all of OpenAI’s largest NVIDIA GB200 supercomputers that we use to train frontier models, including our site with Oracle Cloud Infrastructure (OCI) in Abilene, Texas, and in Microsoft’s Fairwater supercomputers. MRC has been used to train multiple OpenAI models, leveraging hardware from NVIDIA and Broadcom. Today, the MRC specification is available as an Open Compute Project (OCP) contribution for the community to use and build on. We co-authored a paper detailing our experiences,“Resilient AI Supercomputer Networking using MRC and SRv6”⁠.

The foundation: Multi-plane networks

Building highly resilient networks requires that we start with a network topology that has enough natural redundancy that all flows can get good performance, even when links or switches in the network have failed.

Instead of treating each network interface as one 800Gb/s link, we split it into multiple smaller links. For example, one interface can connect to eight different switches. You can then build eight separate parallel networks, or planes, each operating at 100Gb/s, rather than a single 800Gb/s network.

That change has a large effect on the shape of the cluster. A switch that can connect 64 ports at 800Gb/s can instead connect 512 ports at 100Gb/s. This lets you build a network fully connecting about 131,000 GPUs with only two tiers of switches. A conventional 800Gb/s network would require three or four tiers.

MRC’s support for multi-plane networks means we can connect over a hundred thousand GPUs with only two tiers of switches. This reduces the power required, the number of components that can fail, and the total cost of the network compared to conventional approaches.

The result is a network that is lower cost, has lower power consumption, and gives us more path-diversity…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

OpenAI research post, modest HN traction