Facebook failure exposed weak architecture – Computerworld

Posted On: October 13, 2021
Posted By: Winston Ferguson

Facebook claims that the main cause of the crash was human error during routine system maintenance. As a result, the Internet DNS system was at one point unable to provide the IP numbers of the servers serving the services provided by the world’s largest social network.

Due to the failure, IT people were unable to remotely access the network management devices, which would allow them to immediately reboot the entire system. They had to speak to them personally in order to manually restart the system. And it took a long time. Simple service restoration was not an easy task either, especially since it required the coordination of operations performed simultaneously in many data centers.

This has been compounded by the fact that individual data centers have reported huge drops in power consumption in the tens of megawatts, and a sudden reversal of that trend could cause other problems.

Also check:

You should also remember that data centers have security features that make it difficult for them to be tampered with by unauthorized people. They are difficult to access, and once inside, all hardware is designed so that any modification requires authentication. You can physically access the router, but it is not enough. That’s why Facebook hasn’t worked just over seven hours in total, which is the longest in its history so far.

Further analysis of this event seems to suggest that the deactivation of the entire DNS system is due to the malfunction of the software developed by Facebook, which automatically responds to various issues occurring across the entire backbone network.

As you know, Internet DNS turns the names of websites and servers into IP addresses. Facebook has built its own system which operates its internal network. It has an architecture where the DNS service is increased or reduced depending on the availability of DNS servers. When Internet DNS servers ceased to be available at some point due to a failure, the corporate system removed them from its tables.

The point is, these tables are used by Border Gateway Protocol (BGP), which knows the routes to access computers that have specific IP addresses assigned. These routes are sent to routers on a regular basis to provide them with up-to-date information on how to route traffic. As the company’s DNS system removed information used by BGP from its tables, the backbone lost access to it, making it impossible for the entire Internet to find our servers, a Facebook engineer explained.

So what is the lesson from this failure? According to specialists, the blame can be attributed to the specific architecture of Facebook, which caused the failure of the internal DNS system to put the entire Facebook network on the shoulder. The solution may be to create a redundant or backup DNS system, which, in critical moments of failure, would kick in and take over the tasks imposed on the primary system.

Facebook should therefore think about duplicating its DNS service in order to avoid crashes in the future. You can benefit from the experience of other companies who are already using similar solutions. We can cite for example the example of Amazon, whose AWS network has its own DNS system, while using two redundant external services of this type (Dyn and UltraDNS).