Amazon apologises to customers impacted by huge AWS outage

Amazon Web Services (AWS) apologized to customers affected by a massive network outage on Monday after it took some of the world's largest platforms offline.

Snapchat, Reddit and Lloyds Bank were among more than 1,000 sites and services reported to have stopped working following problems at the cloud computing giant's operations center in Northern Virginia, USA, on October 20.

In detailing the reasons for the outage, Amazon said it was the result of errors that prevented its internal systems from connecting to websites with the IP addresses that computers use to find them.

“We apologize for the impact this event has had on our customers,” the company said.

“We know how important our services are to our customers, their applications, end users and their businesses.

“We know this event has had a significant impact on many customers.”

While many platforms, such as online games Roblox and Fortnite, were back up and running within hours of the outage, some services faced extended downtime.

These included Lloyds Bank, which had some customers experiencing problems before midday, as well as US payment app Venmo and social media site Reddit.

The outage had far-reaching consequences – reportedly even disrupting the sleep of some smart bed owners.

Eight Sleep, which makes sleep “pods” with selectable temperatures and altitudes that require an internet connection, said it would work to “crash-proof” its mattresses. after some overheating and even got stuck in an inclined position.

Many experts said the outage showed how dependent technology is on Amazon's dominance in the cloud computing sector, with the market largely monopolized by AWS and Microsoft Azure.

The company said it would also “do everything possible” to learn from the event and improve its accessibility.

In its lengthy report on Monday's power outage.Amazon said it came down to a problem at US-EAST-1, its largest cluster of data centers that powers most of the Internet.

Critical processes in the region's database, which stores and manages Domain Name System (DNS) records to allow computers to understand website URLs, were effectively out of sync.

According to Amazon, this caused a “hidden race condition” – or, in other words, revealed a hidden error that could have occurred in an unlikely sequence of events.

A delay in one process, which Amazon said occurred early Monday morning, had the side effect of causing its systems to stop working properly.

Much of this process is automated, meaning it is performed without human intervention.

Dr Junad Ali, a software engineer and research fellow at the Institute of Engineering and Technology, told the BBC that Amazon's problems were rooted in “wrong automation”.

“The specific technical reason is that faulty automation disrupted the internal 'address book' system in the region,” he said.

“So they couldn't find any of the other key systems.”

Like others, Dr Ali believes this highlights the need for companies to be more resilient and diversify their cloud providers “so they can switch to other data centers and providers when one is unavailable.”

“In this case, those who had a single point of failure in this region of the Amazon could be taken offline,” he said.

Leave a Comment