The Great AWS Outage: What Really Happened on October 20, 2025?

In October 2025, a major AWS outage shook the internet

AWSNEWS

Tharana

10/24/20253 min read

In October 2025, a major AWS outage shook the internet and chances are, you felt its effects even if you didn’t know what was going on. Websites were down, apps stopped working, and even big businesses like Amazon, Twitter, and major banks all experienced hiccups. But what actually happened behind the scenes, and why did it take down so much of the web? Let’s break it down in the simplest way possible.

What is AWS and Why Do We Rely on It?

Amazon Web Services (AWS) is like the power grid of the internet. Companies use AWS for everything from storage and databases to running their websites and critical apps. When AWS sneezes, a big part of the internet catches a cold.

The Phonebook of the Internet: DNS in 2 Minutes

Every time you visit a website or use an app, your device needs to know where to send the request. That’s where DNS (Domain Name System) comes in. Think of DNS as the phonebook of the internet. When you type “amazon.com,” DNS tells your computer the number (IP address) to call so you reach the right place.

What Went Wrong: A Simple Story

Step 1: Meet DynamoDB

DynamoDB is an important database service inside AWS. It stores and fetches data super-fast for tons of websites and apps. To work efficiently, DynamoDB uses a special system to update its own “phonebook entries” (DNS records) so all the traffic goes to the right servers.

Step 2: Two Robots and a Race

AWS set up two “robots” (actually, automated programs called DNS Enactors) in different places to keep DynamoDB’s phonebook entries up-to-date. The idea was to make the system really reliable.

  • Robot #1 started to do an update but got super slow like when your internet lags badly.

  • Robot #2 started after, finished its own update very quickly, and then thought: “Hey, Robot #1 is working on an old plan. I should clean up these old records.”

  • But Robot #1 was still using the records that Robot #2 called “old.”

Step 3: The Catastrophic Mistake

Robot #2, thinking it was helping, removed all the phone numbers (IP addresses) for DynamoDB from AWS’s official DNS. Suddenly, when anyone tried to find DynamoDB, they got no answer—like calling a business and getting “this number is not in service.”

Why Did So Many Services Break?

You might wonder: “If the problem was just with DynamoDB’s phonebook entry, why did so much of AWS break?”

  • Many core AWS services rely on DynamoDB behind the scenesl ike EC2 (virtual servers), Lambda (serverless functions), and even AWS’s own monitoring and support systems!

  • If DynamoDB is unreachable, these services can’t perform key tasks because they can’t find the right data or coordinate actions.

  • It’s like if the main power line for a whole city went down lots of everyday systems stop working, not just the light in one house.

The Domino Effect and Retry Storm

When something breaks, computers usually try again a lot. This is called a retry storm. During the outage:

  • Apps, users, and AWS systems all kept trying to contact DynamoDB, repeating their failures over and over.

  • This flood of requests made things even worse, slowing down recovery and spreading the pain to even more services.

What Is a Race Condition? (And Why Should You Care?)

A “race condition” is a programming problem that happens when two or more pieces of software try to access or change the same thing at the same time and the end result depends on the order and timing. It’s a bit like two people trying to edit the same document at once, and whoever saves last determines the final version.

Race conditions are not unique to AWS; they can happen in any computer system! But when it hits the backbone of the internet, the effects are much bigger.

How Did AWS Fix It?

  • Manual Intervention: AWS engineers had to jump in, update records manually, and reset certain systems.

  • System Changes: Automated DNS updating for DynamoDB was turned off globally until the bug could be fixed.

  • Lesson Learned: AWS and everyone else learned (again) how important it is to plan for rare, unexpected interactions between automation tools a tiny bug can have massive consequences!

Final Thoughts

The October 2025 AWS outage was a perfect storm of automation gone wrong. It reminds us that even the biggest, most reliable services on the internet are still run by code and that code can fail in unexpected ways. By understanding the basics, we help everyone (from tech newbies to cloud architects) appreciate why a strong, thoughtful design and regular checks are so important for a stable, resilient internet.

Have more questions about cloud outages, DNS, or how the internet ticks? Leave a comment below!