Amazon services 'recovering' as Snapchat and banks among sites hit by outage

Liv McMahonTechnology reporter And

Lily JamaliNorth American Technology Correspondent

Getty Images A woman climbs the stairs in front of the giant AWS sign. It's the three letters AWS with an Amazon smiley arrow underneath it.

Amazon Web Services (AWS) said late Monday that it had resolved a massive outage that took some of the world's largest websites offline for much of the day.

More than 1,000 apps and websites, including social media platforms such as Snapchat and banks such as Lloyds and Halifax, have been hit by problems that Amazon says lie at the heart of the cloud computing giant's US operations.

Platform outage monitoring company Downdetector said the number of user reports of problems worldwide rose to more than 11 million during Monday's outage.

Even after Amazon fixed the underlying problem, experts said the outage demonstrated the dangers of so many companies relying on one dominant supplier.

“This episode showed how interdependent our infrastructure is,” said Professor Alan Woodward of the University of Surrey.

“Many online services rely on third-party providers for their physical infrastructure, and this shows that even the largest of these third-party providers can have problems.

“Small mistakes, often made by humans, can have widespread and significant consequences.”

The problems appear to have started around 07:00 BST on Monday, when users began reporting problems accessing multiple platforms.

This included a wide range of different sites and services, from massive online games such as Fortnite to the language learning app Duolingo.

Early this morning, Downdetector told the BBC that in just a few hours it had seen more than four million reports from users across 500 sites – more than double the number seen in the entirety of a normal weekday.

They later peaked at more than 11 million, it said, as more services, including Reddit and Lloyds Bank, tried to recover.

At approximately 23:00 Moscow time, Amazon announced that all AWS services had “returned to normal operation.”

But not before the company had to restrict parts of its system to solve the underlying problem.

After the initial outage, a new series of “cascading failures” could have occurred, said Mike Chapple, a professor of information technology at the University of Notre Dame.

“It's like a large-scale power outage. Crews are starting to work to try and get it back online,” Mr Chapple said. “The power may blink a few times,” he explained, but it's possible that Amazon initially “only addressed the symptoms” and not the cause.

What went wrong?

Amazon has not yet provided details on what caused Monday's outage or issued an official statement on the matter.

An update on the service's status web page states that the issue “appears to be related to DNS resolution of the DynamoDB API endpoint on US-EAST-1.”

DNS, which stands for Domain Name System, is often compared to the Internet phone book.

It effectively translates the names of websites that people use (eg bbc.co.uk) into numbers that can be read and understood by computers.

This process is essentially at the core of how we use the Internet, and failures can result in web browsers being unable to find the content they are looking for.

Matthew Prince, chief executive of Cloudflare, told the BBC that the AWS outage highlighted the impact of cloud services on the internet.

“Everyone had a bad day, Amazon had a bad day today,” he said.

“The amazing thing about the cloud is that it allows you to scale… but if you have an outage like that, it can shut down a lot of the services we rely on.”

And Corey Crider, head of the Institute for the Future of Technology, told the BBC it was “a bit like a bridge collapsing.”

“A significant part of the economy has collapsed,” she said.

And with most cloud computing dependent on Amazon, Microsoft and Google (estimated at about 70%), she said the status quo is “unsustainable.”

“If you have concentrated supply with a handful of monopolistic suppliers, when something like this collapses, it takes a huge percentage of the economy with it,” she said.

“We should really think about trying to buy more local services rather than relying on a handful of American monopoly platforms.

“This is a risk to our security, our sovereignty and our economy, and we need to consider structural decoupling to make our markets more resilient to these types of shocks.”

Watch: BBC's Lucy Woodham asks Cardiff students about Snapchat shutdown

One computer science expert says part of the responsibility lies with the companies using AWS.

“Companies that use Amazon don't pay enough attention to building security into their applications,” says Ken Bierman, a computer science professor at Cornell University in New York.

Outages like Monday's happen frequently, although not always on such a large scale.

Birman told the BBC that app developers should take care to invest in backing up mission-critical apps in the cloud.

“We know how to make these systems stronger, and we know how to do it safely,” Bierman says.

The issue of liability may well end up in court.

More than a year after CrowdStrike's massive outage, Delta Airlines is still in dispute with the company over $500 million in damages.

Even after CrowdStrike fixed the problem, the airline said it had to manually reset 40,000 servers, causing severe flight delays of several days.

Additional reporting by Esillt Carr.