Scaling Kubernetes to 7,500 nodes
Captured source
source ↗Scaling Kubernetes to 7,500 nodes | OpenAI
January 25, 2021
Scaling Kubernetes to 7,500 nodes
Loading…
Share
We’ve scaled Kubernetes clusters to 7,500 nodes, producing a scalable infrastructure for large models like GPT‑3, CLIP, and DALL·E, but also for rapid small-scale iterative research such as Scaling Laws for Neural Language Models.
Scaling a single Kubernetes cluster to this size is rarely done and requires some special care, but the upside is a simple infrastructure that allows our machine learning research teams to move faster and scale up without changing their code.
Since our last post on scaling to 2,500 nodes we’ve continued to grow our infrastructure to meet researcher needs, in the process learning many additional lessons. This post summarizes those lessons so that others in the Kubernetes community can benefit from them, and ends with problems we still face that we’ll be tackling next.
Loading...
Our workload
Before we get too far, it’s important to describe our workload. The applications and hardware we run with Kubernetes are pretty different from what you may encounter at a typical company. Our problems and corresponding solutions may, or may not, be a good match to your own setup!
A large machine learning job spans many nodes and runs most efficiently when it has access to all of the hardware resources on each node. This allows GPUs to cross-communicate directly using NVLink, or GPUs to directly communicate with the NIC using GPUDirect. So for many of our workloads, a single pod occupies the entire node. Any NUMA, CPU, or PCIE resource contention aren’t factors for scheduling. Bin-packing or fragmentation is not a common problem. Our current clusters have full bisection bandwidth, so we also don’t make any rack or network topology considerations. All of this means that, while we have many nodes, there’s relatively low strain on the scheduler.
That said, strain on the kube-scheduler is spiky. A new job may consist of many hundreds of pods all being created at once, then return to a relatively low rate of churn.
Loading...
Our biggest jobs run MPI, and all pods within the job are participating in a single MPI communicator. If any of the participating pods die, the entire job halts and needs to be restarted. The job checkpoints regularly, and when restarted it resumes from the last checkpoint. Thus we consider the pods to be semi-stateful—killed pods can be replaced and work can continue, but doing so is disruptive and should be kept to a minimum.
We don’t rely on Kubernetes load balancing all that much. We have very little HTTPS traffic, with no need for A/B testing, blue/green, or canaries. Pods communicate directly with one another on their pod IP addresses with MPI via SSH, not service endpoints. Service “discovery” is limited; we just do a one-time lookup for which pods are participating in MPI at job startup time.
Most jobs interact with some form of blob storage. They usually either stream some shards of a dataset or checkpoint directly from blob storage, or cache it to a fast local ephemeral disk. We have a few PersistentVolumes for cases where POSIX semantics are useful, but blob storage is far more scalable and doesn’t require slow detach/attach operations.
Lastly, the nature of our work is fundamentally research, which means the workloads themselves are ever-changing. While the Supercomputing team strives to provide what we’d consider a “production” quality level of compute infrastructure, the applications that run on that cluster are short-lived and their developers iterate quickly. New usage patterns may emerge at any time that challenge our assumptions about trends and appropriate tradeoffs. We need a sustainable system that also allows us to respond quickly when things change.
Networking
As the number of nodes and pods within our clusters increased, we found that Flannel had difficulties scaling up the throughput required. We switched to using the native pod networking technologies for our IP Configurations for Azure VMSSes and the relevant CNI plugins. This allowed us to get host level network throughput on our pods.
Another reason we’ve switched to using alias-based IP addressing is that on our largest clusters, we could possibly have approximately 200,000 IP addresses in use at any one time. When we tested route-based pod networking, we found there were significant limitations in the number of routes we could effectively use.
Avoiding encapsulation increases the demands on the underlying SDN or routing engine, but it keeps our networking setup simple. Adding VPN or tunneling can be done without any additional adapters. We don’t need to worry about packet fragmentation due to some portion of the network having a lower MTU. Network policies and traffic monitoring is straightforward; there’s no ambiguity about the source and destination of packets.
We use iptables tagging on the host to track network resource usage per Namespace and pod. This lets researchers visualize their network usage patterns. In particular, since a lot of our experiments have distinct Internet and intra-pod communication patterns, it’s often useful to be able to investigate where any bottlenecks might be occurring.
iptablesmangle rules can be used to arbitrarily mark packets that match particular criteria. Here are our rules to detect whether traffic is internal or internet-bound. TheFORWARD rules cover traffic from pods, vsINPUT andOUTPUT traffic from the host:
Plain Text
1iptables -t mangle -A INPUT ! -s 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-in"2iptables -t mangle -A FORWARD ! -s 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-in"3iptables -t mangle -A OUTPUT ! -d 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-out"4iptables -t mangle -A FORWARD ! -d 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-out"
Once marked, iptables will start counters to track the number of bytes and packets that match this rule. You can eyeball these counters by usingiptables itself:
Plain Text
1% iptables -t mangle -L -v2Chain FORWARD (policy ACCEPT 50M packets, 334G bytes)3 pkts bytes target prot opt in out source destination4....51253K 555M all -- any any anywhere !10.0.0.0/8 /* iptables-exporter openai traffic=internet-out */61161K 7937M all -- any any !10.0.0.0/8 anywhere /* iptables-exporter openai…
Excerpt shown — open the source for the full document.
Notability
Scored, but no written rationale attached yet.
OpenAI has a writing signal matching infrastructure.