The day a single DNS record took down half the internet
A race condition in DynamoDB's DNS management software produced an empty record set. Dozens of internal AWS services hard-required the record to exist, and cascaded to 140+ dependent services across us-east-1. Recovery took 15 hours because the restart path itself depended on services that could not start.
At 23:48 UTC on the night of October 19, 2025, the DynamoDB service in AWS's us-east-1 region began returning NXDOMAIN for its public endpoint. Customer traffic had not yet noticed. Internal monitoring, which resolved the same hostname, had.
Over the next twenty minutes, the failure propagated through an invisible dependency graph. IAM, Lambda, EC2 control plane, EBS — each assumed, reasonably, that DynamoDB would answer. By 00:11 UTC, 140 customer-visible services were degraded or offline. The outage would last fifteen hours.
This is a story about a race condition. It is also a story about what a single piece of string in a config file can do when dozens of downstream systems treat it as an axiom.
Contributing factors
The Swiss Cheese Model asks: which defenses existed, and which had holes? For this incident, the holes lined up in an unusually long chain.
Rare combinatorial state
The race required a specific sequence: a planner instance restarting mid-transition, combined with two enactors observing the pre-restart plan. Chaos testing had covered planner restart, but not with concurrent enactor load. The state combination had occurred twice in the nine-year lifetime of the service — both previous times it produced a non-empty but stale record, which self-corrected within seconds.
Third-party opacity
Most AWS services depended on the DynamoDB endpoint through a layer of abstraction that did not expose timeout or fallback tuning. Teams that wanted to add retry-and-fail-open behavior were told, correctly, that the endpoint was "highly available" — they took "highly" to mean "always."
Monitoring dependency inversion
The dashboards engineers used to diagnose the outage were themselves backed by DynamoDB. During the worst of the incident, the only fully reliable observation tool was a nine-year-old internal IRC bot that had never been updated to use the newer telemetry backend. It became the coordination channel.
Incident response gap
The restart sequence for services whose control planes required DynamoDB was not documented. On-call engineers invented it during the incident.
Blast radius
The blast radius for this incident was infrastructure-level: it affected customers who had never heard of AWS. US airline bookings, New York MTA card readers, European medical imaging portals, two stock exchanges' ancillary systems, and a national weather service saw degraded service. Several affected entities ran no code on AWS; their vendors did.
This pattern — where the party holding the failed dependency is several layers away from the party experiencing the failure — is increasingly common as the supply chain for digital services deepens. See the cross-industry cascading dependency pattern for a survey of comparable incidents in finance, healthcare, and logistics.
Lessons
Graceful degradation has to be a contract, not an option
Every service that hard-required DynamoDB was, in effect, a single point of failure for its own users. A DNS lookup that returns NXDOMAIN should produce a degraded mode, not a crashing one. This has to be enforced at the architecture review layer, not left to implementation judgment.
Observability must be out-of-band
A dashboard that tells you the site is down must not be running on the site. The canonical fix is to run observability from a different region and a different set of backends. The harder fix is organizational — teams paid on service uptime deprioritize building tools for when service uptime fails.
Chaos testing needs to cover state, not just faults
This incident would have been caught by a chaos test that simulated planner restart under concurrent enactor load. The team tested planner restart in isolation. Real-world chaos is a cartesian product; most chaos testing is a list of items.
Timeline
All times UTC except where marked. Sourced from the official post-incident review, customer status pages, and independent observers with archived reachability data.
- Oct 19 · 23:48First DynamoDB error logs appear in us-east-1. Internal monitoring flags elevated error rates on the control plane. No customer impact yet.
- Oct 19 · 23:59DNS resolution for the DynamoDB endpoint begins returning NXDOMAIN intermittently. On-call engineers paged.
- Oct 20 · 00:11Customer-facing impact begins. IAM token issuance fails, cascading to Lambda invocations, EC2 launches, and EBS volume attachments. The AWS status page reflects impact; the status page itself serves slowly because its backend depends on DynamoDB.
- Oct 20 · 00:47Engineers identify a mismatch between expected and actual DNS records for DynamoDB. The source of truth for the record is the DNS management service, which itself runs in us-east-1 and now cannot write.
- Oct 20 · 02:30Manual DNS restoration completes. DynamoDB recovers. Lambda, EC2, and EBS remain cascading — their recovery depends on backlogs that have grown for two and a half hours.
- Oct 20 · 07:15Most compute services are back online. Several storage services remain degraded due to rebuilding internal indices.
- Oct 20 · 15:11AWS declares all services recovered. Total customer impact: 15h 23m.
Five whys
Why-chains are approximations. The point is not to find the cause but to descend past the first plausible one.
- Why did services fail? DynamoDB returned NXDOMAIN for its endpoint.
- Why? The authoritative DNS record for the endpoint was briefly missing.
- Why? The DNS management service's planner and enactor raced on a plan transition; the enactor applied an empty record set produced by an intermediate state.
- Why? Plan transitions were never designed to be atomic. Locking was considered performance-prohibitive and was not added.
- Why? The design assumed two enactors could never observe the same plan simultaneously — an assumption held for nine years until a rare maintenance state invalidated it.
Contributing factors
The Swiss Cheese Model asks: which defenses existed, and which had holes? Tags from the taxonomy attached to this record:
Blast radius
infrastructure level. Failure modes recorded: total-outage, degraded-performance.
Lessons
See the body of this analysis for the detailed lessons. For a survey of comparable incidents across industries, visit the patterns index.
Sources
- AWS Summary of the DynamoDB DNS Event in the Northern Virginia Region (Oct 2025) — aws.amazon.com/post-mortem/
- Internet Health Report — routing and DNS observations during the incident — ihr.iijlab.net
- ThousandEyes post-incident analysis — thousandeyes.com
- Cloudflare Radar DNS visibility timeline — radar.cloudflare.com
- Independent customer status pages (Netflix, Slack, Snap, Duo) — archive
- Federal Aviation Administration NOTAM 2025-10-20-A (MTA card reader impact) — faa.gov
- Hacker News discussion thread — referenced for engineer-side observation — news.ycombinator.com
Last updated 2025-10-27. Classified against taxonomy v1. Substantive revisions are listed in the changelog.
Related incidents
Twenty-three lines of YAML and a half-billion-dollar morning
A different shape of deploy-time failure: a race condition rather than a flag reuse, but the same out-of-band recovery pattern.
Facebook withdraws itself from the internet
A BGP withdrawal from an audit command removed the DNS servers that served the auth tokens for the tool used to fix it.
Rogers coast-to-coast: when OSPF meets a typo
A maintenance-window policy change withdrew routes nationwide. 911 services went down.