Scaling Kubernetes to 2,500 nodes
Captured source
source ↗Scaling Kubernetes to 2,500 nodes | OpenAI
January 18, 2018
Scaling Kubernetes to 2,500 nodes
Loading…
Share
We’ve been running Kubernetes for deep learning research for over two years. While our largest-scale workloads manage bare cloud VMs directly, Kubernetes provides a fast iteration cycle, reasonable scalability, and a lack of boilerplate which makes it ideal for most of our experiments. We now operate several Kubernetes clusters (some in the cloud and some on physical hardware), the largest of which we’ve pushed to over 2,500 nodes. This cluster runs in Azure on a combination of D15v2 and NC24 VMs.
On the path to this scale, many system components caused breakages, including etcd, the Kube masters, Docker image pulls, network, KubeDNS, and even our machines’ ARP caches. We felt it’d be helpful to share the specific issues we ran into, and how we solved them.
etcd
After passing 500 nodes in our cluster, our researchers started reporting regular timeouts from the kubectl command line tool. We tried adding more Kube masters (VMs running kube-apiserver). This seemed to solve the problem temporarily, but once we passed 10 replicas we knew we were treating symptoms and not the cause (by comparison, GKE uses a single 32-core VM for 500 nodes).
This made us strongly suspect our etcd cluster, which is the central store of state for the Kube masters. Looking in Datadog, we saw write latency spiking to hundreds of milliseconds on the DS15v2 machines running our etcd replicas, despite each machine using a P30 SSD capable of 5,000 IOPS.
Benchmarking performance with fio, we saw etcd was only able to use about 10% of the available IOPS because the write latency was 2ms and etcd does sequential I/O, making it latency-bound.
We then moved the etcd directory for each node to the local temp disk, which is an SSD connected directly to the instance rather than a network-attached one. Switching to the local disk brought write latency to 200us, and etcd became healthy!
Our cluster ran well until we passed about 1,000 nodes, at which point we once again saw high commit latency from etcd. This time, we noticed the kube-apiservers were reading more than 500MB/s from etcd. We set up Prometheus to monitor the apiservers, and also set the--audit-log-path and--audit-log-maxbackup flags to enabled more logging on the apiserver. This surfaced a number of slow queries and excessive calls to the LIST API for Events.
The root cause: the default setting for Fluentd’s and Datadog’s monitoring processes was to query the apiservers from every node in the cluster (for example, this issue which is now fixed). We simply changed these processes to be less aggressive with their polling, and load on the apiservers became stable again:
Another helpful tweak was storing Kubernetes Events in a separate etcd cluster, so that spikes in Event creation wouldn’t affect performance of the main etcd instances. To do this, we just set the--etcd-servers-overrides flag to something like this:--etcd-servers-overrides=/events#https://0.example.com:2381;https://1.example.com:2381;https://2.example.com:2381
Another post-1,000 nodes failure was to hit etcd’s hard storage limit (by default 2GB), which causes it to stop accepting writes. This triggered a cascading failure: all our Kube nodes failed their health checks, and our autoscaler decided it thus needed to terminate all the workers. We’ve increased the max etcd size with the--quota-backend-bytes flag, and the autoscaler now has a sanity check not to take action if it would terminate more than 50% of the cluster.
Kube masters
We colocate the kube-apiserver, kube-controller-manager, and kube-scheduler processes on the same machines. For high availability, we always have at least 2 masters, and set the--apiserver-count flag to the number of apiservers we’re running (otherwise Prometheus monitoring can get confused between instances).
We use Kubernetes mainly as a batch scheduling system and rely on our autoscaler to dynamically scale up and down our cluster — this lets us significantly reduce costs for idle nodes, while still providing low latency while iterating rapidly. The default kube-scheduler policy is to spread out load evenly among nodes, but we want the opposite so that unused nodes can be terminated and also so that large pods can be scheduled quickly. So we switched to the following policy:
Plain Text
1{2"kind" : "Policy",3"apiVersion" : "v1",4"predicates" : [5 {"name" : "GeneralPredicates"},6 {"name" : "MatchInterPodAffinity"},7 {"name" : "NoDiskConflict"},8 {"name" : "NoVolumeZoneConflict"},9 {"name" : "PodToleratesNodeTaints"}10 ],11"priorities" : [12 {"name" : "MostRequestedPriority", "weight" : 1},13 {"name" : "InterPodAffinityPriority", "weight" : 2}14 ]15}We use KubeDNS extensively for service discovery, but soon after rolling out the new scheduling policy it started having reliability issues. We found that the failures were only happening on certain pods of KubeDNS. With the new scheduling policy some machines ended up running 10+ copies of KubeDNS, creating hotspots, and we had exceeded the ~200QPS that’s allowed from each Azure VM for external domains lookups.
We fixed this by adding an anti-affinity rule to our KubeDNS pods:
Plain Text
1affinity:2 podAntiAffinity:3 requiredDuringSchedulingIgnoredDuringExecution:4 - weight: 1005 labelSelector:6 matchExpressions:7 - key: k8s-app8 operator: In9 values:10 - kube-dns11 topologyKey: kubernetes.io/hostname
Docker image pulls
Our Dota project started out on Kubernetes, and as it scaled, we noticed that fresh Kubernetes nodes often have pods sitting in Pending for a long time. The game image is around 17GB, and would often take 30 minutes to pull on a fresh cluster node, so we understood why the Dota container would be Pending for a while — but this was true for other containers as well. Digging in, we found that kubelet has a--serialize-image-pulls flag which defaults totrue, meaning the Dota image pull blocked all other images. Changing tofalse required switching Docker to overlay2 rather than AUFS. To further speed up pulls, we also moved the Docker root to the instance-attached SSD, like we did for the etcd machines.
Even after optimizing the pull speed, we saw pods failing to start with a cryptic error message:rpc error: code = 2 desc = net/http: request canceled. The kubelet and Docker logs also…
Excerpt shown — open the source for the full document.