Big Outage Day October 4, 2021

Monday October 2021 was an Internet outage day. Many Facebook users faced outage and felt they had “internjet” instead of Internet. Also ePanorama.net has few hours outage.

ePanorama.net went down few hours earlier. When I got my server up, in few hours Facebook was able to get their site up. Interesting correlation, but correlation does not always indicate causality. The reasons were different and size of incidents also. At this site it was just simple locale server error (out of disk space on critical part) that took some time to get access to and fix. At Facebook the situation was quite different – they had a configuration accident that locked out all of their servers out from Internet.

The crash of Facebook’s services for six hours showed how vulnerable we are. In the future, the internet will be as important as the electricity grid, and it will scare even M-Hyppö, F-Secure’s research director. According to the news agency Reuters, problems had also occurred on Twitter, Google and Amazon at the same time. F-Secure’s Mikko Hyppönen: Concentration of online services in Silicon Valley is a weak point of the Internet Facebook squat “absolutely exceptional”

Understanding How Facebook Disappeared from the Internet article https://blog.cloudflare.com/october-2021-facebook-outage/ gives a good explanation what happened to Facebook:

“The Internet is literally a network of networks, and it’s bound together by BGP. BGP allows one network (say Facebook) to advertise its presence to other networks that form the Internet. As we write Facebook is not advertising its presence, ISPs and other networks can’t find Facebook’s network and so it is unavailable. With those withdrawals, Facebook and its sites had effectively disconnected themselves from the Internet. As a direct consequence of this, DNS resolvers all over the world stopped resolving their domain names.”

You can find the official Facebook explanation posted to https://engineering.fb.com/2021/10/04/networking-traffic/outage/

“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”

Facebook Outage: Yes, its DNS (sort of). A super quick analysis of what is going on
https://isc.sans.edu/forums/diary/Facebook+Outage+Yes+its+DNS+sort+of+A+super+quick+analysis+of+what+is+going+on/27900/

“More readable summary of the analysis below: The BGP routes pointing traffic to Facebook’s IP address space have been withdrawn. The Internet no longer knows where to find Facebook’s IPs. One symptom is that DNS requests are failing. But this is just the result of Facebook hosting its DNS servers inside its own network. Even with working DNS (for example if you still have cached results), the IPs are currently not reachable.”

More information links:

Facebook, Instagram, WhatsApp, and Oculus are down. Here’s what we know [Updated]
The root cause of the worldwide outage appears to be a flubbed BGP route update.
https://arstechnica.com/information-technology/2021/10/facebook-instagram-whatsapp-and-oculus-are-down-heres-what-we-know/

Zuckerberg Loses $5.9 Billion In A Day As Facebook Faces Rare Outage, Whisteblower Testimony
https://www.forbes.com/sites/abrambrown/2021/10/04/zuckerberg-net-worth-billionaire-facebook-stock-outage/

Facebook, WhatsApp and Instagram are slowly returning. Why did they disappear to begin with?
It’s always DNS, except when it’s BGP
https://techcrunch.com/2021/10/04/facebook-whatsapp-instagram-return/

These Three Letter Tech Acronyms Are Likely Behind Today’s Mega Outage At Facebook
https://www.forbes.com/sites/martingiles/2021/10/04/dns-and-bgp-these-acronyms-are-behind-facebooks-mega-outage/

https://krebsonsecurity.com/2021/10/what-happened-to-facebook-instagram-whatsapp/

https://arstechnica.com/information-technology/2021/10/facebook-instagram-whatsapp-and-oculus-are-down-heres-what-we-know/

https://yle.fi/uutiset/3-12128258

https://www.hs.fi/talous/art-2000008309670.html.

https://www.iltalehti.fi/digiuutiset/a/e9d571df-f2b7-48d7-87e6-5836f0425624

https://www.is.fi/digitoday/art-2000008309646.html

10 Comments

  1. Tomi Engdahl says:

    With both BGP and DNS offline, many of the tools and techniques engineers would use to troubleshoot and fix the problem were also unavailable. Humorously, even physical access controls were affected, meaning that FB engineers were locked out of the very datacenters they needed to access to resolve the problem.

    Cloudflare has some interesting insights from their 1.1.1.1 DNS resolver. Namely, when Facebook.com stopped responding, DNS traffic exploded, and global DNS queries for Facebook multiplied thirty-fold. If other domains were timing out or acting strange, it was probably because of that unintentional DDoS on DNS. What caused it? Too many applications written without error handling for facebook.com’s disappearance.

    Source: https://hackaday.com/2021/10/08/this-week-in-security-apache-nightmare-revil-arrests-and-the-ultimate-rickroll/

    Reply
  2. Tomi Engdahl says:

    Karissa Bell / Engadget:
    Instagram says it is testing Activity Feed notifications in the US that will alert users of service outages and technical issues — One week after a massive Facebook outage that took all of the social network’s apps offline for more than six hours, Instagram says it’s testing notifications …

    Instagram is testing in-app notifications for service outages
    The app is also adding a new ‘account status’ feature for personalized notifications.
    https://www.engadget.com/instagram-test-notifications-outages-account-status-210940030.html

    Reply
  3. Tomi Engdahl says:

    THE TWO-NAPKIN PROTOCOL
    https://computerhistory.org/blog/the-two-napkin-protocol/

    It was 1989. Kirk Lougheed of Cisco and Yakov Rekhter of IBM were having lunch in a meeting hall cafeteria at an Internet Engineering Task Force (IETF) conference.

    They wrote a new routing protocol that became RFC (Request for Comment) 1105, the Border Gateway Protocol (BGP), known to many as the “Two Napkin Protocol” — in reference to the napkins they used to capture their thoughts.

    Reply
  4. Tomi Engdahl says:

    Why does the internet keep breaking?
    https://www.bbc.com/news/business-58873472

    I doubt Mark Zuckerberg reads the comments people leave on his Facebook posts.

    But, if he did, it would take him approximately 145 days, without sleep, to wade through the deluge of comments left for him after he apologised for the meltdown of services last week.

    “Sorry for the disruption today” the Facebook founder and chief executive posted, following almost six hours of Facebook, WhatsApp and Instagram being offline.

    Facebook blamed a routine maintenance job for the disruption – its engineers had issued a command that unintentionally disconnected Facebook data centres from the wider internet.

    Around 827,000 people responded to Mr Zuckerberg’s apology.

    The messages ranged from the amused: “It was terrible, I had to talk to my family,” commented one Italian user, to the confused: “I took my phone into the repair shop thinking it was broken,” wrote someone from Namibia.

    And, of course, the very upset and angry: “You cannot have everything shut down at the same time. The impact is unprecedented,” one Nigerian businessman posted. Another from India asked for compensation for the disruption to their business.

    What is clear now, if it wasn’t obvious already, is just how reliant billions of people have become on these services – not just for fun but also for essential communication and trading.

    Many businesses now rely heavily on Facebook services like WhatsApp and Instagram to stay in touch with customers

    What is also clear is that this is far from being a one-off situation: experts suggest widespread outages are becoming more frequent and more disruptive.

    “One of the things that we’ve seen in the last several years is an increased reliance on a small number of networks and companies to deliver large portions of Internet content,”

    “When one of those, or more than one, has a problem, it affects not just them, but hundreds of thousands of other services,” he says. Facebook, for instance, is now used to sign-in to a range of different services and devices, such as smart televisions.

    “And so, you know, we have these sort of internet ‘snow days’ that happen now,” Mr Deryckx says. “Something goes down [and] we all sort of look at each other like ‘well, what are we going to do?’”

    “When Facebook has a problem, it creates such a big impact for the internet but also the economy, and, you know… society. Millions, or potentially hundreds of millions, of people are just sort of sitting around waiting for a small team in California to fix something. It’s an interesting phenomena that has grown in the last couple of years.”

    Inevitably, at some stage, during a large outage of services, people worry that the disruption is the result of some sort of cyber-attack.

    But experts suggest, more often than not, it’s down to a more mundane case of human error, compounded, they say, by the way the internet is held together with a complex set of outdated and fiddly systems.

    Internet scientist Professor Bill Buchanan agrees with this characterisation: “The internet isn’t the large-scale distributed network that DARPA (the Defense Advanced Research Projects Agency), the original architects of the internet, tried to create, which could withstand a nuclear-strike on any part of it.

    “The protocols it uses are basically just the ones that were drafted when we connected to mainframe computers from dumb terminals. A single glitch in its core infrastructure can bring the whole thing crashing to the floor.”

    many of the fundamentals of the net are here to stay for better or worse.

    “In general, the systems work and you can’t just switch certain protocols of the internet ‘off’ for a day, to try to remake them,”

    Instead of trying to rebuild the systems and structure of the internet, Professor Buchanan thinks we need to improve the way we use it to store and share data, or risk more mass outages in the future.

    He argues that the internet has become too centralised, i.e. where too much data comes from a single source. That trend needs to be reversed with systems that have multiple nodes, he explains, so that no one failure can stop a service from working.

    Reply
  5. Tomi Engdahl says:

    Why does the internet keep breaking?
    https://www.bbc.com/news/business-58873472
    I doubt Mark Zuckerberg reads the comments people leave on his Facebook posts. But, if he did, it would take him approximately 145 days, without sleep, to wade through the deluge of comments left for him after he apologised for the meltdown of services last week. “Sorry for the disruption today” the Facebook founder and chief executive posted, following almost six hours of Facebook, WhatsApp and Instagram being offline. Facebook blamed a routine maintenance job for the disruption – its engineers had issued a command that unintentionally disconnected Facebook data centres from the wider internet.

    Reply
  6. Tomi Engdahl says:

    How Quantum Computers Can Impact Security https://www.trendmicro.com/en_us/research/21/j/how-quantum-computers-can-impact-security.html
    If youve been following technology trends over the past few years, youve no doubt heard of the term quantum computing, which many call the next frontier for computing technologies. The promise of a computer that, on paper, has the potential to surpass the capabilities of even todays fastest supercomputers has many players in the tech industry excited, leading to many new startups focusing their efforts on the quantum computing field.

    Reply
  7. Tomi Engdahl says:

    This AWS outage is no joke.

    Azure VMs are down in some regions today. Can’t start them if they’re off.

    Reply
  8. Tomi Engdahl says:

    Why was Facebook down for five hours?
    https://www.youtube.com/watch?v=-wMU8vmfaYo

    Facebook was down for five hours last week. What happened and what do DNS and BGP have to do with it?

    0:00 DNS
    7:13 Caching DNS
    10:34 Hop-by-hop routing
    14:07 Default-free routing
    18:28 Peering
    19:50 BGP
    26:08 The outage

    Facebook’s explanation:
    More details about the October 4 outage
    https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

    Reply
  9. daisygosia says:

    It really looks great, I’ve seen many other posts, that’s the info I needed, thanks for sharing.
    word finder

    Reply

Leave a Comment

Your email address will not be published. Required fields are marked *

*

*