AWS apologises for 14-hour outage and sets out causes of US datacentre region downtime

Amazon Web Services (AWS) has apologized to customers inconvenienced by a 14-hour outage at its largest data center in the United States on October 20. in a blog post detailing the exact nature of the technical difficulties his services were facing..

As Computer Weekly previously reportedThe outage occurred in the public cloud giant's US-East 1 data center region in Northern Virginia and caused large-scale disruption to numerous companies around the world, including in the UK.

Social media and communications services such as Snapchat and Signal have been hit by disruptions to their services, as have Amazon-owned internet businesses such as its retail site, Ring doorbell and Alexa services.

Financial services provider Lloyds Bank Group, along with its Halifax and Royal Bank of Scotland subsidiaries, and government tax collection agency HM Revenue and Customs, in the UK were also affected by the outage.

As a result, HM Treasury is now facing calls to account for why – given its role as the primary cloud provider to the UK financial services sector – AWS has not yet been brought within the scope of the Critical Third Party (CTP) regime.

The initiative gives HM Treasury the power to designate financial services sector providers as CTPs, meaning their activities could be brought under the supervision of various UK financial regulators.

The intention is that this can help better manage any potential risks to the stability and soundness of the UK financial system that could arise from a third-party provider suffering a service outage, as happened to AWS this week.

The company published an extensive summary document following the event, which confirmed that the outage occurred in three separate stages as a result of problems in several parts of its infrastructure.

As such, the company reported that shortly before 8am UK time on October 20, its fully managed serverless NoSQL database offering Amazon DynamoDB began experiencing an increase in application programming interface (API) errors that lasted just under three hours.

Then, from approximately 1pm UK time on October 20, some Network Load Balancers (NLBs) in the US-East-1 region began experiencing increased connection errors that continued until approximately 10pm that same day. “This was caused by failures in testing the health of the NLB fleet, which led to an increase in connection errors,” the summary document said.

In addition, AWS reported that issues were occurring when attempting to launch instances of its Elastic Cloud Compute (EC2) virtual servers, and the issue persisted from approximately 10:30 a.m. on October 20 UK time until 6:30 p.m.

“Launching new EC2 instances failed, although launching instances became successful as of 10:37 a.m. PDT. [6.37pm UK time]Some recently launched instances experienced connectivity issues, which were resolved by 1:50 p.m. [9.50pm UK time]”, the final document continued.

He also confirmed that other AWS services hosted in US-East-1 suffered side effects as a result of issues faced by DynamoDB, EC2 and its network credit balancing setup.

“We are making several changes as a result of this operational event,” the company said. “As we continue to work through the details of this event across all AWS services, we will be looking for additional ways to avoid exposure to a similar event in the future and ways to further reduce recovery time.”

The company then concluded a summary document apologizing to all customers affected by the outage.

“While we have extensive experience delivering our services with the highest level of availability, we know how critical our services are to our customers, their applications, end users and their businesses,” the summary document states. “We know this event has had a significant impact on many customers. We will do our best to learn from this event and use it to further improve our availability.”

Leave a Comment