Fastly is a cloud computing service provider whose network is described as an edge cloud platform. The latter is used to support developers in extending their core cloud infrastructure to get closer to the users. After finally making many sites accessible, the American provider explains the origin of the failure.
The company Fastly was faced with an outage that took down many news websites around the world. According to reports, a software bug related to changes in settings on one of its platforms was the cause of this unfortunate incident.
Indeed, several questions arise from this incident and they exclusively concern the dependence of the Internet network on certain IT providers. On the day of the breakdown, many sites could not be consulted abroad. This was the case, for example, with the websites of Bloomberg News, CNN, the Financial Times, the Guardian and the New York Times. They are not the only ones, of course. Several others (very well known for their important traffic) were also inaccessible this same day. These include Amazon, PayPal, Reddit, Spotify, and others. Clearly, this was quite a blow.
Overall, the outage at Fastly was of a large scale and very serious as you can imagine. A situation that the company regretted because of the consequences of this incident on the functioning of its customers. The company also showed its sorrow towards all those for whom its customers count. It was indeed possible to anticipate this problem.
Fastly uses several servers to store its customers' content close to the end users. This allows end users to access their content more quickly and securely. For this reason, these servers are strategically located near the users.
However, the cause of the failure is now known. Some software intended for Fastly's customers (on May 12) had been updated. This update contained a bug that was activated after a customer changed the settings.
Note: In conclusion, it is this readjustment that is responsible for the famous "error return" of up to 85% in the company's network.
The various engineers who were called in to analyze the situation were able to pinpoint the real cause of the failure. They then deactivated the incriminated parameters. Thus, the incident was resolved with the return to normal of a larger part of the network.
Specifically, the network was working at more than 95% in the 49 minutes after the deactivation of the parameters. Already at 12:35 GMT, the network was fully operational (100%). There was subsequently the implementation of a collective software at 17:25.