AWS outage update: What happened yesterday and why

UPDATE: Tuesday 9:30 a.m. ET.: While Amazon's AWS services were fully restored by Tuesday, the effects of the massive outage are still becoming apparent.

Problems with one service have caused major disruptions to the basic things that make our lives functional. The canvas crasheddisrupting learning across the country. Lloyds Bank clients lost access to your accounts. Some United Airlines flyers were unable to register or view their bookings. People's alarms didn't go off. There are too many examples to list. it was a complete disaster.

For some, Monday was an example of big tech companies being too big. If an AWS outage can cause such widespread problems, that could be a problem.

“If a company can destroy the entire Internet, it is too big. Period,” wrote Democratic Sen. Elizabeth Warren on X. “It’s time to break up Big Tech.”

This tweet is currently unavailable. It may be downloading or has been deleted.

UPDATE: Monday 8:20 PM ET.: Amazon has provided more information on how to recover its AWS services. and noted“By 15:01 [PT, or 6:01 p.m. ET), all AWS services returned to normal operations. Some services such as AWS Config, Redshift, and Connect continue to have a backlog of messages that they will finish processing over the next few hours. We will share a detailed AWS post-event summary.”

UPDATE Monday, 5:05 p.m. ET: The latest updates from Amazon indicated its AWS services were progressing toward full resolution.

“Service recovery across all AWS services continues to improve,” the company wrote. It noted it was continuing to “reduce throttles” on certain affected tools.

UPDATE Monday, 3:41 p.m. ET: Amazon indicated its AWS services were well on the way to fully recovering.

“We continue to observe recovery across all AWS services,” the company wrote. It did note customers may still face “intermittent function errors” with Lambda, its serverless compute service.

AWS saw a major outage in the early hours of Monday morning, a temporary recovery, and then further issues as the East Coast neared midday. You can read the full explanation of the outages in both the original story and our regular updates to this article, but, in short, any problem with AWS means major issues for large swaths of the internet. Sites and services such as United Airlines, Snapchat, McDonald's, Verizon, Venmo, and countless others all saw spikes in user-reported issues on Downdetector.

While the internet is vast, there are a few pillars — AWS perhaps chief among them — that can lead to large, disrusptive downstream effects should they experience problems.

UPDATE Monday, 3:01 p.m. ET: Amazon said its continued efforts to remedy issues with its AWS services appeared to be working, noting it saw “decreasing networking connectivity issues,” in its most recent update on its status page.

Users still reported a relatively high number of issues with AWS on Downdetector, though many third-party services apparently affected by the AWS outage appeared to be recovering.

It's been a tremendously turbulent Monday for AWS. The popular cloud platform saw a major outage in the early morning hours, briefly recovered, and then experienced new problems around midday.

(Disclosure: Downdetector is owned by Ziff Davis, the same parent company as Mashable.)

UPDATE Monday, 2:15 p.m. ET: Amazon said its efforts to fix its connectivity issues appear to be working. Its widely popular AWS cloud platform suffered renewed issues starting around midday, just hours after a major outage during the early hours of Monday morning.

The company wrote its “mitigations to resolve launch failures” were progressing and that it expected “launch errors and network connectivity issues to subside” as it worked to apply fixes more widely.

UPDATE Monday, 1:15 p.m. ET: Amazon wrote it was working to fix connectivity issues that arose midday Monday ET, hours after a major outage in the early hours of the day.

Mashable Speed of Light

“We continue to implement mitigations for network load balancer health and connectivity restoration for most AWS services,” reads the latest update on the site AWS Status Page.

Mike Chappleprofessor of information technology at the University of Notre Dame, said further problems arising after the initial failure are not necessarily an unexpected event.

“Although this is devastating, it is not unusual. The process of fixing a major problem in an IT infrastructure often creates new problems, and fixes often have to be distributed to a large number of systems over time,” Chapple said in a statement emailed to Mashable. “While engineers are working on the stability of the system, its operation is gradually stabilizing and everything is returning to normal. Think of it like a power outage that happens in a big city. The power may turn on and off several times while maintenance crews do their work. We’re seeing something similar now with AWS.”

SEE ALSO:

AWS Outage: Canvas not working?

UPDATE Monday 12:15 pm ET: Amazon said it was focusing on the underlying problem that caused new problems with AWS on Monday.

“We have narrowed down the source of network connectivity issues that were impacting AWS services,” the latest update to the site states. AWS Status Page. “The main reason is the internal subsystem responsible for monitoring the health of our network load balancers.”

It was not yet clear when the glitches and problems would be completely resolved.

UPDATE Monday 11:45 am ET: Amazon confirmed that AWS was experiencing new problems late Monday morning, just hours after the issue had apparently been resolved. The company wrote that it is investigating “the root cause of network connectivity issues affecting AWS services such as DynamoDB, SQS, and Amazon Connect” in its latest update AWS Status Page.

Meanwhile, widespread Internet service outages continued. The number of problems reported by users has increased sharply on a number of popular services, according to the data. Downdetectorincluding FanDuel, Snapchat, Apple Music, Asana, Verizon and many more. The renewed problems with AWS were serious and again caused problems for a large number of users.

A service outage at Amazon Web Services (AWS), Amazon's popular cloud hosting and data service, has caused huge problems for internet users starting their work week on Monday. Since AWS powers a huge portion of the Internet, the list of services and sites that were affected by outages on Monday was quite staggering.

According to user reports of problems in Downdetector websiteAffected services include United Airlines, AT&T, Fortnite, Disney+, HBO Max, Signal, Snapchat, McDonald's, Verizon, Venmo and many more. (Disclosure: Downdetector is owned by Ziff Davis, the same parent company as Mashable..) Amazon services such as Prime and Alexa were also affected. In short: almost everyone could have been hurt in one way or another.

Almost everything we own is connected to the Internet – our refrigerators. Wi-Fi enabled billboards – this means that an AWS outage could disrupt the lives of many people.

By midday it became clear that the problem had been resolved. But then AWS Health Dashboard from Amazons These problems resurfaced.

“We have confirmed that several AWS services are experiencing network connectivity issues in the US-EAST-1 region,” it said in a message around 10:30 a.m. ET. “We are seeing early signs of connectivity issues returning and are continuing to investigate the root cause.”

It looks like AWS is experiencing problems again, although not on the same scale as in previous hours. Some services such as Venmo And Enhance Mobilesaw a corresponding increase in user-reported issues on Downdetector.

Amazon previously said What the problem is either completely solved or is being solved. Mashable reached out for comment and was directed to AWS Health Dashboard. At approximately 6:35 a.m. ET, the AWS health dashboard indicated that the underlying issue had been resolved, although issues may persist as things go live. Perhaps this could indicate new problems that have surfaced.

“The underlying DNS issue has been completely resolved and most AWS service operations are now running normally,” the update said at 6:35 a.m. ET. “Some requests may be limited while we work toward full approval.”

What caused the AWS outage?

The exact cause of AWS's initial failure remains unknown, but we have a guess. Services using AWS were unable to access DynamoDB, Amazon's database, because there was an issue in the Domain Name System (DNS). DNS effectively resolves website names into IP addresses. So when Amazon wrote on its health dashboard that the DNS issue was “fully resolved,” it meant the real problem had been fixed.

“Amazon stored the data securely, but no one else could find it for several hours, causing applications to be temporarily separated from their data,” Mike Chapple, professor of information technology at the University of Notre Dame, told CNN. “It’s as if most of the Internet is suffering from temporary amnesia.”

Rafe Pilling, director of threat intelligence at cybersecurity company Sophos, said Guardian that the incident did not appear to be a cyber attack or anything nefarious, which is consistent with Amazon's statements.

“When something like this happens, it's understandable to worry that it's a cyber incident,” he told the British publication. “AWS has a broad and complex footprint, so any issue can cause significant problems.”

Amazon will likely explain what happened on Monday later. It's unclear how the “network connection issues” at 10:35 a.m. ET relate (if at all) to the initial DNS issue, although it seems reasonable to assume that problems may arise as services begin to return to normal.

Why is AWS failure such a big deal?

In short: AWS is the central pillar of the modern Internet. Without this, things go to ruin. As large companies have eaten up market share, it has effectively made the Internet's infrastructure extremely fragile – the problem with AWS or Googleor Microsoftor Crowd strike that means tons of problems users.

Advocates even argue that this dependence on these big players is a free speech issue.

“We urgently need diversification in cloud computing,” said Dr. Corinne Kath-Speth, head of digital human rights organization Article 19. in accordance with Guardian. “The infrastructure that underpins democratic discourse, independent journalism and secure communications cannot depend on a handful of companies.”

In short: if something goes wrong with AWS, many everything goes wrong.