WritingScalewayScalewaypublished Sep 18, 2024seen 5d

Update: Scaleway Object Storage incident across September & October 2024

Open original ↗

Captured source

source ↗
published Sep 18, 2024seen 5dcaptured 3dhttp 200method plain

Update: Scaleway Object Storage incident across September & October 2024 Incidents • Thomas Gerbier • 08/11/24 • 5 min read

Incident Overview

Between September, 24 and October, 23 2024, Scaleway Object Storage encountered a period of increased instability in the FR-PAR region following architectural upgrades aimed at enhancing performance and scalability. During this time, some customers experienced elevated error rates and increased upload latency.

Incident status links:

https://status.scaleway.com/incidents/938lgd7wn1zg

https://status.scaleway.com/incidents/b4dw2zl83837

Regions Impacted : FR-PAR

Duration : 7 days until mitigation, 1 month to get back to nominal performances

Primary Impact : Instability and increased latency on S3 upload performance, affecting certain users.

Summary of Events

In September 2024, we initiated a series of infrastructure improvements for Scaleway Object Storage, including the migration of PUT methods to our new Object Storage gateway and the deployment of an upgraded load-balancing architecture in the FR-PAR region. These changes were designed to reduce latency and improve scalability, as successfully observed in prior rollouts in other regions (NL-AMS and PL-WAW).

However, after scaling up the deployment in the FR-PAR region on September 23, we observed an unexpected increase in 503 errors, indicating instability. Initial analysis showed that the FR-PAR region's higher load conditions made it particularly susceptible to unforeseen issues, despite thorough monitoring during earlier deployments. The migration could not be reverted due to the architectural complexity of the update, leading to a longer mitigation delay than usual.

Incident Timeline

September 18, 2024: Migration of PUT methods to the new Object Storage Gateway completed in FR-PAR

September 23, 2024: New load balancer architecture deployed in FR-PAR to handle increased request volumes

September 25, 2024: Initial incident opened following increased 503 errors. A patch was deployed but did not fully resolve the issue

September 28, 2024: Second incident opened. Another patch deployed but rolled back due to unintended side effects

September 30, 2024: Final mitigation patch deployed, temporarily stabilizing the service but causing a slight increase in latency

October 4, 2024: Partial hardware mitigation implemented in FR-PAR, yielding significant performance improvements

October 7-8, 2024: Additional upgrades were made to the new load balancers servers (upgrading from 64GB to 512GB of RAM) to resolve memory-related issues

October 23, 2024: Full deployment of the long-term fix across all impacted regions, restoring performance to optimal levels.

Root Cause Analysis

Increased Load on FR-PAR: The unique conditions in the FR-PAR region, particularly higher request loads, revealed an unexpected sensitivity in our infrastructure that was not observed during earlier regional deployments

Memory Limitations: New load balancer servers in FR-PAR were initially provisioned with 64GB of RAM, which proved insufficient under suddenly higher traffic conditions, leading to memory exhaustion and early-termination of processes

Connection Management: Issues with HTTP Keep-Alive timeout settings between our new Gateway and load balancers led to inefficient handling of some client requests, exacerbating latency issues

Patch and Rollback Challenges: Although multiple patches were quickly developed, early solutions had to be rolled back due to unintended side effects. Also, no rollback was possible for the initial architectural upgrades, prolonging the resolution.

Impact on Customers

During this period, customers in the FR-PAR region may have observed:

Elevated 503 errors and occasional request failures, particularly during peak hours

Increased latency on object uploads, with temporary performance degradation.

Customers were advised to implement retries on failed requests to mitigate the impact, as further optimizations were implemented.

Resolution and Improvements

The resolution involved:

Memory Upgrades: New load balancers servers in FR-PAR were upgraded from 64GB to 512GB, significantly improving stability under high loads

Enhanced Connection Management: HTTP Keep-Alive settings were fine-tuned between the Object Storage gateway and load balancers, which improved response times and connection stability

Improved Fault Tolerance: A new upload mechanism was developed, enhancing the fault tolerance of PUT operations, particularly in handling intermittent errors.

These improvements culminated in a full resolution of the incident on October 23, 2024. Performance gain was confirmed in FR-PAR compared to before the architectural upgrades that had triggered the incident. Customer feedback quickly confirmed satisfaction with the overall optimization of the service.

Customer Support and SLAs

Despite this incident, we maintained overall SLA compliance for September (99.93% uptime against a 99.0% SLA target for single-zone and 99.90% for multi-AZ configurations). October overall SLA was not deteriorated by the incident.

Next Steps and Continuous Improvement

This incident has highlighted areas where we can enhance both our infrastructure and our processes. As part of our commitment to continuous improvement, we are:

Strengthening our monitoring and alerting to detect similar issues earlier in the deployment cycle

Implementing a more robust change-management process to improve rollback options for complex architectural upgrades

Exploring advanced deployment methods, including blue-green deployments

Improving external communication before impacting production deployment (maintenance windows).

We remain dedicated to delivering reliable and performant Object Storage services to all our customers and thank you for your understanding as we continue to make improvements based on the lessons learned from this incident.

Scaleway provides updates in real time of all of its services’ status, here . Feel free to contact us via the console for any questions moving forwards. Thank you!

Notability

notability 1.0/10

Infrastructure incident, not AI-lab related