WritingScalewayScalewaypublished Sep 18, 2024seen 5d

Update: Details of “Important loss of connectivity in VPC on fr-par region” incident & response

Open original ↗

Captured source

source ↗

Update: Details of “Important loss of connectivity in VPC on fr-par region” incident & response Incidents • Alexis Bauvin • 18/09/24 • 7 min read

On September 1st, 2024, at 08:27 UTC, Scaleway experienced a major VPC incident affecting the fr-par-1 and fr-par-2 availability zones. The impacts ranged from general networking instability, to DNS and DHCP failures. It was resolved by 12:46 UTC the same day.

This is a story of snowball effects, and of how BGP (Border Gateway Protocol), a protocol built for reliability, can let you down after all.

A primer on Scaleway's infrastructure

Our infrastructure is built on a set of standard technologies, using off-the-shelf software. However, we don't use any turnkey solutions. This includes our virtual networking stack.

We already communicated a while back about it: the core of a Scaleway AZ (Availability Zone) is an IP fabric, built using VXLAN and BGP-EVPN. VPC is simply another overlay network atop the same IP Fabric underlay, also leveraging VXLAN and BGP-EVPN. There is, however, one key difference: VPC does BGP to the hypervisor.

BGP 1.png

Basic VPC architecture. Note that the IP fabric itself has its own set of route-reflectors, and runs BGP between the leaves, the spines and the backbone. It’s BGP all the way down.

BGP is used to propagate layer 2 reachability information from one hypervisor to the other. When you send a packet from an instance to another one, it is thanks to BGP that the hypervisor knows where to send it.

In the above schematic something already surfaces: VPC's route-reflectors have an order of magnitude more BGP peers than those for the IP fabric. A single leaf pair can hold tens of hypervisors. This obviously leads to scaling and reliability issues on those route-reflectors that would need to handle thousands of BGP sessions in large zones:

Having that much responsibility makes the route-reflector highly critical. Too critical. Its blast radius in case of failure is too high.

Performance problems can quickly arise, as a single route needs to be propagated to all the other peers. Basically, every byte in can lead to kilobytes out.

The latter was the main technical hurdle that led us to shard our hypervisors in groups, called "fabrics", and to dedicate a route-reflector for each fabric, and make those communicate. This way, each one only handles half of the hypervisors, while still sharing routes.

Each shard can (and does) scale to several hundred hypervisors. Please note that from this point onwards, we won’t talk anymore about the IP fabric.

And then, even two pairs were not enough. We added a third pair. We added a fourth pair, continuing the mesh between the route-reflector clusters. Ultimately, we had to make a CLOS topology with our route-reflectors when fr-par-1 grew its 6th route-reflector cluster, in order to avoid further amplification. We now have "spine" and "leaf" route-reflectors.

fr-par-1 is now pretty large, and we have CLOS at every level now. As before, each shard can host hundreds of hypervisors, the biggest reaching 800 at its peak.

We'll come back to the hand waived "rest of the region" part of the schematic.

A detour on BGP software stacks

For a while, we’ve used multiple BGP implementations, both open-source and proprietary. We started out VPC using open-source BGP software on the hypervisors; and proprietary BGP virtual appliances for the route reflectors, which is basically a backbone router’s stack in a VM. This choice was driven by the robustness of this BGP implementation, powering large swaths of the internet on their platform; it is well known and battle tested; and it already powers our IP Fabric. Yet, we quickly found out it was not robust enough for our use-case.

Enter fr-par-1. Our oldest, biggest availability zone. A few years ago, when we hit the 500 hypervisors mark there, we had issues. The zone already had a somewhat large amount of VPC routes, but nothing too dramatic. If it were not for the 500 BGP sessions.

We started to see flaps (1) left and right.

The vendor’s BGP stack and TCP stack could not keep up, peers had to wait too long for information to propagate, gave up, and reset the session to start from scratch.

Restart from scratch? Well, yes. Tearing down a session and starting from scratch is BGP's main (if only) error handling mechanism. One of the cases where a BGP speaker will reset the session is when its peer does not respect the BGP protocol. And handling messages fast enough is one part of the protocol: peers send periodic keepalive messages, and expect an answer in a timely manner. Speakers also expect peers to read their incoming message queue often enough, in a timely manner.

In BGP terms, this is the "BGP hold timer", and its main purpose is to detect crashed peers, and to not waste time with them.

It turns out our route-reflector was overloaded and could not handle messages quickly enough, letting the hold timers expire left and right, leading to session flaps. This was our first, historic, large-scale, incident on VPC.

From this day onwards, we knew the appliance could not handle the load, and we sharded it. And gave it massive resources to help it keep up. And when scaling up VPC, we kept looking out for better solutions, and got very interested in Free Range Routing , or FRR. FRR is an open-source suite of networking daemons we already use a lot at Scaleway for its integration with the Linux kernel.

We now use FRR for our route-reflectors, and we eventually made the switch early 2024 for our new deployments. However, we are still running the proprietary software everywhere, hoping to do a progressive migration leveraging the natural lifecycle of hypervisors. Only a few small clusters were running FRR, as well as our spine route-reflectors, shown in yellow in the previous schematic.

It all begins with a blip

On Sunday, September the 1st, a few sessions between spine and leaf route-reflectors flapped, at 08:26:02 UTC. While a flap is not a normal event and it should be inspected, it should not have any impact. And at first, they had none, while they were between FRR route-reflectors.

But it generated a lot of BGP traffic, due to the withdrawals immediately followed by the updates of the whole RIB (3).

A minute later, at 08:27:41 UTC, chaos ensued. The influx of noise reached the biggest appliance-based cluster, with 800 hypervisors. Its sessions with spine route-reflectors went down. In less than 5…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Operational incident, not AI-lab event