A single point of failure triggered the Amazon outage affecting millions

In turn, the delay in network state propagation affects the network load balancer, which depends on the stability of AWS services. As a result, AWS customers were experiencing connection errors from the US-East-1 region. AWS networking functions affected included creating and modifying Redshift clusters, Lambda calls, and running Fargate tasks such as managed workflows for Apache Airflow, Outposts lifecycle operations, and the AWS Help Center.

Amazon has currently disabled DynamoDB DNS Planner and DNS Enactor globally while it works to fix the race condition and add protections to prevent incorrect DNS plans from being applied. Engineers are also making changes to EC2 and its network load balancer.

A cautionary tale

Ookla identified a contributing factor not mentioned by Amazon: the concentration of clients that route their connections through the US-East-1 endpoint and the inability to route across the entire region. Okla explained:

The affected US‑EAST‑1 is the oldest and most commonly used AWS hub. Regional concentration means that even global applications often tie identity, state, or metadata flows there. When regional dependency fails, as it did in this case, the consequences ripple throughout the world as many “global” stacks pass through Virginia at some point.

Modern applications integrate managed services such as storage, queues, and serverless functions. If DNS cannot reliably resolve a critical endpoint (such as the DynamoDB API involved here), errors cascade through upstream APIs and cause visible failures in applications that users do not associate with AWS. This is exactly what Downdetector has seen in Snapchat, Roblox, Signal, Ring, HMRC and others.

This event serves as a cautionary tale for all cloud services: More important than preventing race conditions and similar errors is eliminating individual points of failure in network design.

“The path forward,” Ookla said, “is not zero failure, but limited failure, achieved through multi-regional design, diversity of dependencies and disciplined incident preparedness, with regulatory oversight that moves toward viewing the cloud as a systemic component of national and economic resilience.”

Leave a Comment