Anatomy of an Outage: Deconstructing the AWS DynamoDB Service Disruption

A deep dive into the October 2025 AWS outage, dissecting the root causes in DynamoDB's DNS management and the subsequent cascading failures across multiple services.

Diagram showing interconnected AWS services affected by a cascading outage, highlighting a central DNS failure.
The AWS DynamoDB service disruption in October 2025 served as a stark reminder of the intricate dependencies within cloud infrastructure and the critical role of robust DNS management.

In the digital realm, us-east-1 is hallowed ground. It is the original Amazon Web Services region, the bedrock upon which a significant portion of the modern internet is built. When it falters, the digital world holds its breath. On October 19th and 20th, 2025, it did more than falter; it fractured, revealing the intricate, and at times terrifying, interconnectedness of the systems we depend on. The nearly 15-hour disruption was not a simple hardware failure or a botched deployment. It was a textbook case of cascading failure, a slow-motion implosion triggered by a single, latent bug in a core service's Domain Name System (DNS) management.

The official AWS post-mortem, while detailed, reads like a dispatch from a battlefield, documenting the fall of one bastion after another. But for the architects and engineers who build upon this foundation, it is more than a report; it is a blueprint of systemic risk. To understand what happened, we must look beyond the sequence of events and analyze the anatomy of the failure itself—a story of race conditions, congestive collapse, and the deadly paradox of automated resilience.

The First Domino: A Race to Zero in DNS

The incident began in the digital heart of AWS: DynamoDB, the fully managed NoSQL database service that underpins countless AWS services and customer applications. At 11:48 PM PDT, DynamoDB's API endpoints in us-east-1 began to vanish from the internet. The cause was not a network partition or a server crash, but something far more subtle: an empty DNS record.

DynamoDB, like other hyper-scale services, relies on a sophisticated automated system to manage its DNS records. This is not merely about pointing a domain name to an IP address. It's a dynamic process of load balancing, health checking, and capacity management across a vast fleet of servers. The system, designed for resilience, is split into two components: a DNS Planner and a DNS Enactor.

  1. The DNS Planner: This component monitors the health of the DynamoDB fleet and generates "plans"—sets of weighted IP addresses for the service's various endpoints (e.g., dynamodb.us-east-1.amazonaws.com).
  2. The DNS Enactor: This component, designed with minimal dependencies for recovery scenarios, takes these plans and applies them to Amazon Route 53, the DNS service. For redundancy, multiple Enactor instances run independently in different Availability Zones (AZs).

The failure was born from a race condition between these redundant Enactors. The sequence of events, while unlikely, is a masterclass in distributed systems failure modes:

  1. Delay: An Enactor instance (Enactor A) picked up an older DNS plan but was significantly delayed in applying it.
  2. Progress: While Enactor A was stalled, the Planner generated several newer plans. A second instance, Enactor B, picked up the newest plan and successfully applied it to all endpoints.
  3. Cleanup: After successfully applying its new plan, Enactor B initiated a cleanup process, deleting plans that were now several generations old.
  4. The Race: Just as this cleanup began, the delayed Enactor A finally succeeded in applying its much older plan to the main regional endpoint, overwriting the correct, new record. A staleness check that should have prevented this failed because of the extreme delay between when the check was performed and when the update was applied.
  5. Deletion: The cleanup process from Enactor B, now running, saw the old plan that Enactor A had just applied. Recognizing it as obsolete, it promptly deleted it.

The result was catastrophic. The deletion of the active plan removed all IP addresses for dynamodb.us-east-1.amazonaws.com, leaving an empty record. The system, now in an inconsistent state, blocked any further updates, requiring manual intervention. A system designed for high availability had, through an unforeseen interaction, orchestrated its own takedown.

For nearly three hours, any service or application attempting to resolve the primary DynamoDB endpoint in the region received nothing. The digital front door to one of AWS's most critical services had vanished.

The Cascade: Congestive Collapse in EC2

While AWS engineers scrambled to manually repair DynamoDB's DNS, the second act of the failure was already beginning. The Elastic Compute Cloud (EC2), the foundational compute service of AWS, was starting to feel the aftershocks. The impact wasn't on running instances, but on the control plane responsible for managing them.

This phase of the outage centered on a subsystem called the DropletWorkflow Manager (DWFM). DWFM manages the physical servers ("droplets") that host EC2 instances. A key function is maintaining a "lease" with each droplet, a periodic check-in to confirm its state. This process depends on DynamoDB.

When DynamoDB went dark, these lease renewals began to fail. Droplets with expired leases are considered ineligible for new instance launches. By the time DynamoDB was restored at 2:25 AM, a significant portion of the EC2 fleet in us-east-1 had lost its lease.

What followed was a state of "congestive collapse." As DWFM came back online, it was flooded with a massive backlog of lease renewals. The system became so overwhelmed trying to process the recovery work that its progress was negligible. Leases were timing out faster than they could be re-established. The system was spinning its wheels, unable to make forward progress. This is a classic failure pattern where a system's throughput plummets under extreme load as it spends more resources on coordination and retries than on productive work.

Engineers were forced to manually throttle incoming work and perform rolling restarts of DWFM hosts to clear the queues. This painstaking process meant that even though DynamoDB was back, new EC2 instance launches continued to fail for hours with "insufficient capacity" errors—not because there was no physical capacity, but because the control plane couldn't manage it.

The problem was compounded by a second EC2 subsystem, the Network Manager, which is responsible for propagating network configurations to new instances. As DWFM slowly recovered and instances began to launch, the Network Manager was hit with its own backlog of configuration updates, leading to newly launched instances having no network connectivity for an extended period. The cascade was no longer a sequence but a parallel storm of failures.

The Feedback Loop: A Health Check Paradox

The third major act of this systemic failure demonstrates an even more insidious pattern: a monitoring system actively contributing to instability. The Network Load Balancer (NLB) service, which routes traffic to backend targets like EC2 instances, began to experience widespread connection errors.

The root cause lay in the EC2 network propagation delays. The NLB health checking system, doing its job, was provisioning new EC2 instances to handle load. However, because of the Network Manager backlog, these new instances were coming online without proper network configuration.

This created a destructive feedback loop:

  1. A new NLB node (an EC2 instance) is launched.
  2. The Network Manager is backlogged, so the instance's network configuration is not applied.
  3. The NLB health checker tests the new node. The check fails because the node is unreachable.
  4. The health checker marks the node as unhealthy and removes it from service.
  5. A short time later, the health check runs again. This time it might succeed intermittently, or the network configuration finally propagates. The node is added back into service.
  6. This flapping—rapidly alternating between healthy and unhealthy states—put immense strain on the health checking subsystem itself, causing it to degrade.

This "health check paradox," where the act of observing the system's health negatively impacts it, led to automatic failover mechanisms kicking in. For multi-AZ load balancers, entire AZs were automatically removed from service. If the remaining capacity was insufficient, customer applications experienced connection errors, even if their own backend targets were perfectly healthy. The very system designed to ensure resilience was shedding healthy capacity and causing outages.

The Blast Radius: A Global Dependency Graph Exposed

The combined failures of DynamoDB, EC2, and NLB sent shockwaves across the AWS ecosystem. The list of impacted services reads like a catalog of the cloud itself: Lambda, ECS, EKS, Fargate, S3, STS, and more. Each failure tells a story of hidden dependencies.

  • AWS Lambda: Function invocations failed, event source mappings from SQS were delayed, and internal Lambda systems became under-scaled because they couldn't launch new EC2 instances to replace failed ones.
  • Amazon Redshift: The data warehousing service was hit by a particularly revealing dependency. Not only did clusters in us-east-1 fail, but customers in all other regions who used IAM user credentials for queries were blocked. This was due to a "defect" where Redshift in every region made a call to an IAM API endpoint located specifically in us-east-1. This shatters the common assumption of strong regional isolation for control planes.
  • AWS Support: In a moment of supreme irony, the AWS Support Console itself failed. While it was designed to fail over to another region, a dependent subsystem responsible for account metadata began returning invalid responses instead of no response. The failover logic, not anticipating this state, incorrectly blocked legitimate users from creating support cases during the crisis.

Lessons from the Rubble: Architectural Imperatives

This event is a stark reminder that the cloud is not an infallible utility. It is a distributed system of unprecedented scale and complexity, and complexity is the enemy of reliability. For architects, SREs, and engineers, the lessons are clear and urgent.

  1. Scrutinize Your Dependencies: The primary lesson is that your application's availability is the product, not the sum, of its dependencies. The formula for system availability, $A_{system}$, where $A_n$ is the availability of a component, is not $A_{system} = \frac{1}{n}\sum A_n$, but rather: $$A_{system} = \prod_{i=1}^{n} A_i$$ A single critical dependency can bring the entire system down. This incident exposed deep, non-obvious dependencies not only within customer applications but within AWS's own services. You must aggressively map and question every dependency, especially those related to control planes like IAM, STS, and DNS.
  2. Regions Are Not Perfect Silos: The Redshift and AWS Console sign-in failures prove that cross-region dependencies exist, even at the control plane level. A robust multi-region strategy cannot simply assume that a second region will be unaffected by an outage in the first. It requires deep testing of failure modes, including the failure of global or pinned-region services. Ask your provider and yourself: which parts of my stack have a hidden dependency on us-east-1?
  3. Design for Congestive Collapse: The EC2 DWFM failure is a critical lesson in queue management and load handling during recovery. Systems must be designed to gracefully shed load when overwhelmed. Implementing mechanisms like load-adaptive throttling, bounded queues, and circuit breakers on recovery pathways is not optional; it is essential for surviving a "thundering herd" of recovery operations.
  4. Rethink Health Checks: The NLB paradox shows that health checks can be destabilizing. Health checks should be intelligent. They need to differentiate between a truly failed target and a target that is merely slow to initialize. Implementing longer grace periods for new instances, using multiple modes of health checking, and adding velocity controls to limit how quickly capacity can be removed are crucial defensive measures.

The October 2025 outage was a painful but invaluable lesson in the nature of modern infrastructure. It demonstrated that in hyper-scale systems, the most dangerous threats often come not from outside attackers, but from the emergent behavior of the system's own internal logic. The intricate dance of automation and redundancy, designed to provide resilience, contained the seeds of a catastrophic failure. For those of us tasked with building reliable systems on this ever-more-complex foundation, the message is clear: trust, but verify. And always, always plan for the cascade.

Subscribe to Root Logic

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe