Amazon’s Typo Crash
On March 2nd a vast majority of the internet came to a grinding halt. I for one found myself bumbling through the general procedure of trying to figure out what was causing my personal internet issues. This process led me to find out that the websites I would use to check to see if other websites where up and running was in fact down. That was the moment I realized something was amiss, through a little research I was able to find that Amazon’s S3 servers were down. As the understanding of the full expanse of the issue dawned on me, I knew this was going to be a big deal.
The immediate problem of having websites going down is, of course, a large problem that can cost companies a lot of money. But the big issue was that this was the first time that Amazon had dropped below 99% uptime, which of course means that Amazon has to pay out any company using their S3 servers 25% Service Credit Percentage.
So how did this all happen? Turns out a S3 team member ran a scripted command which by accident had a small typo in it. Now, this script was intended to take a few servers offline, but instead a large number of servers were taken offline. Those servers helped support two other S3 subsystems. This is a fairly serious issue, but it could have been taken care of by doing a full restart. This restart instead took more servers down. It was slow train wreck that was eventually resolved, but it did reveal glaring issues with having such a large set up of systems. Not that Amazon did anything wrong, but it was a reminder to the team to always watch your code and always check on old servers.
WANT TO SEE MORE?
WANT TO SEE WHAT ELSE WE HAVE BEEN THINKING ABOUT?