November 08, 2017

Famous Cloud Outages

Learning from failures is definitely painful (if you're the one failing) but also a very informative learning experience.  I'm trying to compile a list of interesting cloud outages to learn from and improve.  (May or may not have something to do with my own team's recent outage :P)

(I'm slowly updating this)

AWS - 2017 S3 Outage - Inadvertent removal of important servers in Northern Virginia
An engineer trying to debug a slow billing system inadvertently removed 2 important servers that were serving two S3 subsystems.  One of them, the index subsystem, managed the metadata and location information of all S3 subsystems in that region (Northern Virginia) which served all GET, LIST, PUT, and DELETE requests.  The other, the placement subsystem, managed allocation of new storage, and depended on the index subsystem.  These two subsystems required a full restart, during which S3 was unable to service requests.  Other AWS services in that region that relied on S3 for storage were impacted while the S3 APIs were unavailable.

Amazon stated that while the S3 subsystems were resilient to limited capacity (to a degree), it was not prepared for the index/placement subsystems failures.

Improvements to address this outage included (1) building more safeguards into the debugging tool to remove capacity more carefully, and (2) recovering services more quickly from failures by splitting them up into smaller units.  The smaller units also serve to debug large systems extensively.

Azure - 2017 Outage - Overheated data center in Northern Europe caused sudden shutdown of machines
This one is dear to my heart because I remember this happening at work.

Due to human error, one of Azure's data centers overheated which caused some servers and storage systems to suddenly shut down.  This caused many dependent resources to fail - virtual machines were shut down to prevent data corruption; HDInsight, Azure Scheduler and Functions, and Azure Stream Analytics dropped jobs; Azure Monitor and Data Factory saw increased errors and latency in their pipelines.

Microsoft pointed out that customers who'd deployed to availability sets wouldn't have been affected by the outage.

AWS - 2012 EBS Outage - Memory leak caused by DNS propagation failures overwhelmed Elastic Block Store (EBS) volumes
Each EBS storage server contacts some data collection servers that report data.  While the data is important, it is not time sensitive, so the system is tolerant of late/missing data.  One of the data collection servers failed and had to be replaced, and as part of replacing it, a DNS record was updated to remove the failed server and add the replacement server.  However, the DNS update didn't successfully propagate to all the internal DNS servers so a fraction of the storage servers didn't get the updated server address and continued to attempt to contact the data collection server that was taken out.  Because the data collection service was tolerant of missing data, this didn't raise any alarms.  However, the inability to contact a data collection server triggered a memory leak on the storage servers, and rather than gracefully deal with the failed connection, the storage servers continued trying to contact the data collection server and slowly consumed memory.  The monitoring system failed to catch this memory leak, and eventually this consumed enough memory on the affected storage servers that they were unable to keep up with requests.

The number of stuck EBS volumes increased quickly.  The system began to failover from unhealthy to healthy servers.  But because many of the servers failed at the same time, the system was unable to find enough healthy servers to failover to, and eventually a large number of volumes in the Availability Zone were stuck.

This throttled the EC2 and EBS APIs, affected accessibility in some RDS databases, and hindered some of the Elastic Load Balancers (ELB)'s traffic routing ability.

Amazon made changes to propagate DNS changes more reliably and to monitor/alert more aggressively.  They deployed resource limits to prevent low priority processes from consuming excess resources on EBS storage servers.  They relaxed their throttling policies of the APIs, improved failover logic of the RDS databases (in particular the Multi Availability Zone databases which were designed to handle this), and improved reliability of the load balancers by issuing more IP capacity, reducing the interdependency between EBS and ELB, and improving traffic shifting logic so that traffic is more quickly rerouted away from a degraded Availability Zone.