WritingOpenAIOpenAIpublished Jan 18, 2018seen 6d

Scaling Kubernetes to 2,500 nodes

Open original ↗

Captured source

source ↗
published Jan 18, 2018seen 6dcaptured 2dhttp 200method exa

Scaling Kubernetes to 2,500 nodes | OpenAI

January 18, 2018

Scaling Kubernetes to 2,500 nodes

Loading…

Share

We’ve been running⁠ Kubernetes⁠ for deep learning research for over two years. While our largest-scale workloads manage bare cloud VMs directly, Kubernetes provides a fast iteration cycle, reasonable scalability, and a lack of boilerplate which makes it ideal for most of our experiments. We now operate several Kubernetes clusters (some in the cloud and some on physical hardware), the largest of which we’ve pushed to over 2,500 nodes. This cluster runs in Azure on a combination of D15v2 and NC24 VMs.

On the path to this scale, many system components caused breakages, including etcd, the Kube masters, Docker image pulls, network, KubeDNS, and even our machines’ ARP caches. We felt it’d be helpful to share the specific issues we ran into, and how we solved them.

etcd

After passing 500 nodes in our cluster, our researchers started reporting regular timeouts from the kubectl⁠ command line tool. We tried adding more Kube masters (VMs running kube-apiserver⁠). This seemed to solve the problem temporarily, but once we passed 10 replicas we knew we were treating symptoms and not the cause (by comparison, GKE⁠ uses a single 32-core VM for 500 nodes).

This made us strongly suspect our etcd⁠ cluster, which is the central store of state for the Kube masters. Looking in Datadog⁠, we saw write latency spiking to hundreds of milliseconds on the DS15v2 machines running our etcd replicas, despite each machine using a P30 SSD capable of 5,000 IOPS.

Benchmarking performance with fio⁠, we saw etcd was only able to use about 10% of the available IOPS because the write latency was 2ms and etcd does sequential I/O, making it latency-bound.

We then moved the etcd directory for each node to the local temp disk, which is an SSD connected directly to the instance rather than a network-attached one. Switching to the local disk brought write latency to 200us, and etcd became healthy!

Our cluster ran well until we passed about 1,000 nodes, at which point we once again saw high commit latency from etcd. This time, we noticed the kube-apiservers were reading more than 500MB/s from etcd. We set up Prometheus⁠ to monitor the apiservers, and also set the--audit-log-path and--audit-log-maxbackup flags to enabled more logging on the apiserver. This surfaced a number of slow queries and excessive calls to the LIST API for Events.

The root cause: the default setting for Fluentd⁠’s and Datadog’s monitoring processes was to query the apiservers from every node in the cluster (for example, this issue⁠ which is now fixed). We simply changed these processes to be less aggressive with their polling, and load on the apiservers became stable again:

Another helpful tweak was storing Kubernetes Events in a separate etcd cluster, so that spikes in Event creation wouldn’t affect performance of the main etcd instances. To do this, we just set the--etcd-servers-overrides flag to something like this:--etcd-servers-overrides=/events#https://0.example.com:2381;­https://1.example.com:2381;­https://2.example.com:2381

Another post-1,000 nodes failure was to hit etcd’s hard storage limit (by default 2GB), which causes it to stop accepting writes. This triggered a cascading failure: all our Kube nodes failed their health checks, and our autoscaler⁠ decided it thus needed to terminate all the workers. We’ve increased the max etcd size with the--quota-backend-bytes flag, and the autoscaler now has a sanity check not to take action if it would terminate more than 50% of the cluster.

Kube masters

We colocate the kube-apiserver, kube-controller-manager⁠, and kube-scheduler⁠ processes on the same machines. For high availability⁠, we always have at least 2 masters, and set the--apiserver-count flag to the number of apiservers we’re running (otherwise Prometheus monitoring can get confused between instances).

We use Kubernetes mainly as a batch scheduling system and rely on our autoscaler⁠ to dynamically scale up and down our cluster — this lets us significantly reduce costs for idle nodes, while still providing low latency while iterating rapidly. The default kube-scheduler policy is to spread out load evenly among nodes, but we want the opposite so that unused nodes can be terminated and also so that large pods⁠ can be scheduled quickly. So we switched to the following policy:

Plain Text

1{2"kind" : "Policy",3"apiVersion" : "v1",4"predicates" : [5 {"name" : "GeneralPredicates"},6 {"name" : "MatchInterPodAffinity"},7 {"name" : "NoDiskConflict"},8 {"name" : "NoVolumeZoneConflict"},9 {"name" : "PodToleratesNodeTaints"}10 ],11"priorities" : [12 {"name" : "MostRequestedPriority", "weight" : 1},13 {"name" : "InterPodAffinityPriority", "weight" : 2}14 ]15}

We use KubeDNS⁠ extensively for service discovery, but soon after rolling out the new scheduling policy it started having reliability issues. We found that the failures were only happening on certain pods of KubeDNS. With the new scheduling policy some machines ended up running 10+ copies of KubeDNS, creating hotspots, and we had exceeded the ~200QPS that’s allowed from each Azure VM for external domains lookups.

We fixed this by adding an anti-affinity rule⁠ to our KubeDNS pods:

Plain Text

1affinity:2 podAntiAffinity:3 requiredDuringSchedulingIgnoredDuringExecution:4 - weight: 1005 labelSelector:6 matchExpressions:7 - key: k8s-app8 operator: In9 values:10 - kube-dns11 topologyKey: kubernetes.io/hostname

Docker image pulls

Our Dota⁠ project started out on Kubernetes, and as it scaled, we noticed that fresh Kubernetes nodes often have pods sitting in Pending⁠ for a long time. The game image is around 17GB, and would often take 30 minutes to pull on a fresh cluster node, so we understood why the Dota container would be Pending for a while — but this was true for other containers as well. Digging in, we found that kubelet⁠ has a--serialize-image-pulls flag which defaults totrue, meaning the Dota image pull blocked all other images. Changing tofalse required switching Docker to overlay2 rather than AUFS. To further speed up pulls, we also moved the Docker root to the instance-attached SSD, like we did for the etcd machines.

Even after optimizing the pull speed, we saw pods failing to start with a cryptic error message:rpc error: code = 2 desc = net/http: request canceled. The kubelet and Docker logs also…

Excerpt shown — open the source for the full document.