What does this writing signal mean?

Scaleway published Update: FR-PAR Kapsule API incident & response. This talking signal gives public context for research themes, product direction, policy, or launch framing. High-signal details: Update: FR-PAR Kapsule API incident & response Incidents • Thibault Genaitay • 01/02/24 • 2 min read On January 24 at 23:48 Central European Time (CET), Scaleway.... onlylabs links this event to 1 captured evidence page and 6 related writing signals.

Scaleway Writing: Update: FR-PAR Kapsule API incident & response

Captured source

source ↗

scaleway.com/scaleway.com/en/blog

Update: FR-PAR Kapsule API incident & response

Source ↗

published Feb 1, 2024seen 5dcaptured 3dhttp 200method plain

Update: FR-PAR Kapsule API incident & response Incidents • Thibault Genaitay • 01/02/24 • 2 min read

On January 24 at 23:48 Central European Time (CET), Scaleway encountered an incident in the FR-PAR region that impacted customers using the Kubernetes managed services Kapsule, Kosmos, either with mutualized or dedicated environments, and some products such as Cockpit.

It was resolved by 01:03 the same night. Here’s an overview of what happened.

Timeline of the Incident

23:48: Scaleway Kubernetes API experienced an Out of Memory (OOM) situation. The incident began, affecting the Kubernetes FR-PAR region API, leading to nodes being unable to authenticate with their control-plane, resulting in some nodes becoming NotReady

23:48: Increased load detected on the API Gateway, with requests spiking significantly

00:00:10: Incident escalated internally and handled by Scaleway engineers on call

00:00 - 00:30: Diagnostic ongoing: OOM situation identified as the root cause of the outage

00:35 - 00:40: Measures to increase the API Kubernetes's RAM and launch additional replicas were taken

00:43: Remediation worked, API Gateway requests decreasing fast

00:54: Kubernetes nodes were coming back up, heavy queue of actions on instances being monitored

01:03: All FR-PAR clusters began to stabilize with the implementation of further measures like purging the registry cache and monitoring for any further issues

01:13: Metrics on Cockpit began to reappear, indicating full recovery.

Post-incident: Continuous monitoring and adjustments were made to ensure stability, including adjustments to service configurations.

Impact

Increase in the API gateway response times resulting in some unavailability (reaching timeouts)

Kubernetes nodes unable to authenticate and thus temporarily entered the 'NotReady' state, leading to service disruption

When relevant, activation of the cluster auto-healing process, generating automatic node replacement.

Root Cause and Resolution of the Issue

On Kubernetes Side

The incident originated from an Out of Memory (OOM) scenario within the Scaleway Kubernetes API, triggered by simultaneous deployments in FR-PAR. The situation was further complicated because this API was currently handling authentication, preventing Kubelet to update their leases, leading nodes to becoming NotReady and retrying indefinitely.

Resolution:

The following corrective actions were taken for remediation:

Technical Adjustments: Immediate increase in the memory limit and garbage collection thresholds for the Kubernetes managed services.

Scaling: Deployment of additional replicas for key services to handle increased load.

Long-term Solutions:

We are planning to implement the following safeguards:

Implementation of a local authentication cache

Further developments on node authentication.

On API Gateway Side

The API Gateway experienced a significant increase in load, primarily due to the overwhelming number of requests originating from the malfunctioning components on the Kubernetes side.

Despite the increased load, the API Gateway managed to operate without a complete outage. Adjustments already implemented due to a past incident allowed us to better manage sudden spikes in requests.

Conclusion

This incident underscores the importance of robust monitoring and rapid response mechanisms in managing unexpected system behaviors. We are committed to learning from this incident and have already implemented several improvements to prevent such occurrences in the future.

These measures include enhancing our monitoring capabilities, adjusting our rate limiting strategies, and improving our incident response protocols. We apologize for any inconvenience caused and are grateful for your understanding and support as we continue to enhance our systems to serve you better.

_Scaleway provides updates in real time of all of its services’ status, here. Feel free to contact us via the console for any questions moving forwards. Thank you!_