On October 4th 2021, for approximately six hours, Facebook and most of their other web services became unavailable, as millions suddenly became unable to connect with each other online.
It’s far from the first time Facebook has experienced downtime, but this was one of the more serious of outages. That’s because it was global, it lasted a longer-than-usual six hours, affected both their apps and websites, and took down all of their four most popular products – Messenger, Instagram, WhatsApp and Facebook itself.
Users who tried to visit the websites were given DNS related error messages, meaning this wasn’t simply a coding error in a website itself. As far as the Internet was concerned, the websites simply didn’t exist.
So what happened?
Facebook has offered a rough, outline of an explanation as to what happened, claiming that configuration changes to “backbone routers” had caused issues with data communication inside Facebook HQ, and this had a cascading effect that resulted in all their services coming to a sudden halt.
Sponsored Content. Continued below...
That’s basically the typical, vague and mostly unhelpful response most companies tend to give out when things go wrong, and is effectively tantamount to “well we changed something and messed it all up”.
However, what experts on the Internet have been able to determine is that, through some configuration mishap (the specifics of which we may never fully know) Facebook basically erased its own BGP routing information from the Internet.
BGP stands for Border Gateway Protocol, and is a standardised set of rules governing how Internet traffic is routed across the Internet. The simplest analogy is to think of BGP like a postal service, which determines the best route for mail to take to reach its destination.
The Internet is comprised of thousands of networks, often called autonomous systems. When someone in the UK wants to visit a website in the USA for example, their request to load up that website needs to be directed across this vast landscape of networks, and then the data that comprises that website needs to be directed back to the user’s computer. It is BGP that controls this routing, picking the fastest and most efficient route across the Internet as possible, so the website loads up as fast as possible.
Sponsored Content. Continued below...
Now, if you operate your own autonomous system, as many large high-tech companies like Facebook do, you are responsible for making available your own BGP routing information (your own address and instructions to find your address.) This information is then shared across the Internet so all relevant routers and devices know where to direct your traffic.
However, if you delete your BGP routing information, as Facebook appear to have done, the Internet no longer knows where to direct your traffic. You’ve effectively deleted your address from the very entities that need to know it.
So, when someone tries to access the Facebook website, the Internet simply says “I don’t know where that website is, or even if it exists at all.”
And the user will get an error.
To go back to the postal service analogy, as a website visitor, typing in Facebook.com into your browser is akin to trying to mail envelopes with random squiggles on them. The postal service will simply not know where to deliver your mail.
And fixing the mishap wasn’t as simple as reversing whatever update that caused the error. Since most of Facebook’s servers, tools and products went down, so did their ability to reverse the goof, meaning Facebook employees needed to gain physical access to their data centres to begin fixing the issue – something made harder when employees were reportedly physically locked out because of a misfiring card reader that had been using – you guessed it – Facebook’s products to operate correctly.
We wouldn’t expect it’s a mistake that Mark Zuckerberg or Facebook will be repeating any time soon.