The day a single DNS record took down half the internet

2025-10-27· 15h 23m downtime· ~$400M estimated impact· 140+ services affected· 12 min read

software · race conditiondesign · no graceful degradationemergent · cascading failureprocess · incident response gap

TL;DR

A race condition in DynamoDB's DNS management software produced an empty record set. Dozens of internal AWS services hard-required the record to exist, and cascaded to 140+ dependent services across us-east-1. Recovery took 15 hours because the restart path itself depended on services that could not start.

At 23:48 UTC on the night of October 19, 2025, the DynamoDB service in AWS's us-east-1 region began returning NXDOMAIN for its public endpoint. Customer traffic had not yet noticed. Internal monitoring, which resolved the same hostname, had.

Over the next twenty minutes, the failure propagated through an invisible dependency graph. IAM, Lambda, EC2 control plane, EBS — each assumed, reasonably, that DynamoDB would answer. By 00:11 UTC, 140 customer-visible services were degraded or offline. The outage would last fifteen hours.

This is a story about a race condition. It is also a story about what a single piece of string in a config file can do when dozens of downstream systems treat it as an axiom.

Contributing factors

The Swiss Cheese Model asks: which defenses existed, and which had holes? For this incident, the holes lined up in an unusually long chain.

Rare combinatorial state

The race required a specific sequence: a planner instance restarting mid-transition, combined with two enactors observing the pre-restart plan. Chaos testing had covered planner restart, but not with concurrent enactor load. The state combination had occurred twice in the nine-year lifetime of the service — both previous times it produced a non-empty but stale record, which self-corrected within seconds.

Third-party opacity

Most AWS services depended on the DynamoDB endpoint through a layer of abstraction that did not expose timeout or fallback tuning. Teams that wanted to add retry-and-fail-open behavior were told, correctly, that the endpoint was "highly available" — they took "highly" to mean "always."

Monitoring dependency inversion

The dashboards engineers used to diagnose the outage were themselves backed by DynamoDB. During the worst of the incident, the only fully reliable observation tool was a nine-year-old internal IRC bot that had never been updated to use the newer telemetry backend. It became the coordination channel.

Incident response gap

The restart sequence for services whose control planes required DynamoDB was not documented. On-call engineers invented it during the incident.

Blast radius

The blast radius for this incident was infrastructure-level: it affected customers who had never heard of AWS. US airline bookings, New York MTA card readers, European medical imaging portals, two stock exchanges' ancillary systems, and a national weather service saw degraded service. Several affected entities ran no code on AWS; their vendors did.

This pattern — where the party holding the failed dependency is several layers away from the party experiencing the failure — is increasingly common as the supply chain for digital services deepens. See the cross-industry cascading dependency pattern for a survey of comparable incidents in finance, healthcare, and logistics.

Lessons

Graceful degradation has to be a contract, not an option

Every service that hard-required DynamoDB was, in effect, a single point of failure for its own users. A DNS lookup that returns NXDOMAIN should produce a degraded mode, not a crashing one. This has to be enforced at the architecture review layer, not left to implementation judgment.

Observability must be out-of-band

A dashboard that tells you the site is down must not be running on the site. The canonical fix is to run observability from a different region and a different set of backends. The harder fix is organizational — teams paid on service uptime deprioritize building tools for when service uptime fails.

Chaos testing needs to cover state, not just faults

This incident would have been caught by a chaos test that simulated planner restart under concurrent enactor load. The team tested planner restart in isolation. Real-world chaos is a cartesian product; most chaos testing is a list of items.

Timeline

All times UTC except where marked. Sourced from the official post-incident review, customer status pages, and independent observers with archived reachability data.

Oct 19 · 23:48First DynamoDB error logs appear in us-east-1. Internal monitoring flags elevated error rates on the control plane. No customer impact yet.
Oct 19 · 23:59DNS resolution for the DynamoDB endpoint begins returning NXDOMAIN intermittently. On-call engineers paged.
Oct 20 · 00:11Customer-facing impact begins. IAM token issuance fails, cascading to Lambda invocations, EC2 launches, and EBS volume attachments. The AWS status page reflects impact; the status page itself serves slowly because its backend depends on DynamoDB.
Oct 20 · 00:47Engineers identify a mismatch between expected and actual DNS records for DynamoDB. The source of truth for the record is the DNS management service, which itself runs in us-east-1 and now cannot write.
Oct 20 · 02:30Manual DNS restoration completes. DynamoDB recovers. Lambda, EC2, and EBS remain cascading — their recovery depends on backlogs that have grown for two and a half hours.
Oct 20 · 07:15Most compute services are back online. Several storage services remain degraded due to rebuilding internal indices.
Oct 20 · 15:11AWS declares all services recovered. Total customer impact: 15h 23m.

Five whys

Why-chains are approximations. The point is not to find the cause but to descend past the first plausible one.

Why did services fail? DynamoDB returned NXDOMAIN for its endpoint.
Why? The authoritative DNS record for the endpoint was briefly missing.
Why? The DNS management service's planner and enactor raced on a plan transition; the enactor applied an empty record set produced by an intermediate state.
Why? Plan transitions were never designed to be atomic. Locking was considered performance-prohibitive and was not added.
Why? The design assumed two enactors could never observe the same plan simultaneously — an assumption held for nine years until a rare maintenance state invalidated it.

Contributing factors

The Swiss Cheese Model asks: which defenses existed, and which had holes? Tags from the taxonomy attached to this record:

rare combinatorial statesingle team knowledgethird party opaquemonitoring gap

Blast radius

infrastructure level. Failure modes recorded: total-outage, degraded-performance.

Lessons

See the body of this analysis for the detailed lessons. For a survey of comparable incidents across industries, visit the patterns index.

Sources

Last updated 2025-10-27. Classified against taxonomy v1. Substantive revisions are listed in the changelog.

Related incidents

2012 · config · process

The day a single DNS record took down half the internet

Contributing factors

Rare combinatorial state

Third-party opacity

Monitoring dependency inversion

Incident response gap

Blast radius

Lessons

Graceful degradation has to be a contract, not an option

Observability must be out-of-band

Chaos testing needs to cover state, not just faults

Timeline

Five whys

Contributing factors

Blast radius

Lessons

Sources

Related incidents

Twenty-three lines of YAML and a half-billion-dollar morning

Facebook withdraws itself from the internet

Rogers coast-to-coast: when OSPF meets a typo