WritingScalewayScalewaypublished Oct 18, 2024seen 5d

Update: Scaleway Elements Partial Loss of VPC Connectivity in FR-PAR-1

Open original ↗

Captured source

source ↗
published Oct 18, 2024seen 5dcaptured 3dhttp 200method plain

Update: Scaleway Elements Partial Loss of VPC Connectivity in FR-PAR-1 Incidents • Pavel Lunin • 18/10/24 • 4 min read

On Friday, October 18, 2024, Scaleway Elements cloud ecosystem experienced a network incident that caused a cascade of events in various products in the FR-PAR-1 Availability Zone. The incident started at 6:20 UTC and had two main impact periods:

From 6:20 to 6:21 UTC : around a minute of instability for a very limited number of instances hosted in a single rack. Some customers might have experienced heavy packet loss for all types of network connectivity for a subset of their instances.

From 7:50 to 8:01 UTC : unavailability of VPC network connectivity for a subset of around 25% of hypervisors of FR-PAR-1 AZ and subsequent impact for VPC dependent products. Public internet connectivity was not impacted during this period. Elastic Metal VPC connectivity was not impacted.

Scaleway Elements Infrastructure

The Scaleway Elements cloud ecosystem is built on top of stacked layers of infrastructure:

Data centers: electrical power systems, cooling systems, optical fiber cabling and physical security

Hardware and physical network infrastructure: servers, data center fabric networks, backbone routers, and inter-datacenter links

Virtualized multi-tenant cloud foundation (Virtual machines running on top of hypervisors + Virtualized software defined network, providing multi-tenant VPC networks and VPC edge services, such as DHCP and DNS within VPC)

High-level PaaS products: K8S, Database, Load Balancer, Serverless, Observability and many more, running on top of VM instances and using VPC networks for internal communication.

These layers run on top of each other: the higher layers are dependent on the lower layers of the infrastructure.

In a high-load massive scale environment, consisting of many thousands of physical machines, potential hardware failures are routine events happening several times per week. The vast majority of them are invisible to the customers as we build our infrastructure in a redundant fashion. All of the layers have their own redundancy and failover mechanisms which make them resilient to the failures in the lower layers. All the critical systems have at least two instances (often more), deployed in an active-active fashion with a 50% load threshold, so if a critical instance fails, the remaining capacity is able to handle 100% of the load.

Timeline

At 6:20 UTC, one of the Top of Rack (ToR) switches experienced a software crash. Due to the unstable state of its software modules, it took around 50 seconds for the network protocols to failover all traffic of the rack to the second ToR switch

During the convergence time, the traffic to and from the hypervisors in the impacted rack experienced a high percentage of packet drops of a non-deterministic nature

By 6:21 UTC, all traffic was fully rerouted to the backup device and the instability was resolved

At 6:38 UTC, the crashed switch terminated its automatic reboot and restored normal operation of the redundant ToR mode

However, this instability caused a cascade effect on one of the infrastructure blocks of the virtualized network infrastructure: a BGP route reflector (RR), used for VPC services. This software was hosted on one of the hypervisors in the impacted rack

The RR software stack experienced an instability and got stuck in an abnormal state which could not be resolved by the auto-restart process

At this point, customers didn’t experience any impact of the VPC services as the backup RR was operating normally. However the RR redundancy was lost

At 7:50 UTC, the second RR experienced a critical condition and also got stuck in a non-operational state

At this point customers experienced a disruption of the VPC connectivity for a subset of FR-PAR-1 AZ Scaleway Elements virtual products. Around 25% of the hypervisors lost VPC connectivity with the rest of the region

Both RR software stacks have health-check monitoring and auto-restart mechanism which should have addressed this type of failure. The health-check monitoring successfully detected the anomaly however the auto-restart mechanism failed

The impact was detected by Scaleway Customer Excellence and SRE teams, and an incident was opened with subsequent notification of the technical top-management

Both VPC route reflectors were fixed by a manual action

By 8:01 UTC VPC connectivity was fully restored

By 8:07 UTC the situation was back to nominal state with redundancy operating normally.

VPC Dependent Products Impact

Managed Database and Redis

Impacted during both periods:

Short period of connectivity loss of the impacted rack (6:20-6:21 UTC). Some Database customers experienced 500 HTTP errors for connections to their databases

VPC connectivity loss for 25% of FR-PAR-1 hypervisors (7:50-8:01 UTC). Impacted customers could not use VPC to connect to their managed databases.

Serverless Functions and Containers

During the VPC connectivity loss period (7:50-8:01 UTC) one of the serverless compute nodes was unavailable. A subset of customers with workloads on this node could experience service disruptions, including potential 500 errors when calling their functions/container while their workloads were being rescheduled.

Kapsule

During the VPC connectivity loss period (7:50-8:01 UTC), there was a network partition between nodes, preventing applications running on different nodes to communicate.

The infrastructure hosting control-planes uses VPC, and was impacted too, causing some control-plane unavailability. Unfortunately, some nodes were replaced by the autohealing because they were unable to report their availability, causing workloads to be rescheduled/restarted.

By 8:04 UTC almost all clusters were recovered.

Lessons Learned and Further Actions

We have fixed the autohealing mechanism which failed to fix the route reflectors in a stale state: https://github.com/FRRouting/frr/pull/17163

We are planning to introduce software version diversity for route reflectors to avoid multiple instances being impacted by a single bug

We plan to investigate in-depth the software issue which caused that stale state.

Scaleway provides updates in real time of all of its services’ status, here . Feel free to contact us via the console for any questions moving forwards. Thank you!

Notability

notability 1.0/10

Routine infrastructure incident update, not AI-lab related.