cloud glow co-founder and CEO Matthew Prince described the outage on Tuesday, November 18, which disrupted global internet traffic for several hours, as worst crash since 2019stating that the traffic management giant has not encountered an issue that caused much of its underlying traffic to stop flowing through its network in more than six years.
“Outages like today are unacceptable. We have designed our systems to be very resilient to failures to ensure a constant flow of traffic. When we have experienced failures in the past, it has always resulted in new, more resilient systems being built,” Prince said. “On behalf of the entire Cloudflare team, I would like to apologize for the pain we caused the Internet today.”
Disabling Cloudflare started at 11.20 UTC (6:20 a.m. EST) on Tuesday when its network began experiencing severe outages in delivering basic traffic, which manifested itself to regular web users as an error page indicating that Cloudflare's network had failed when attempting to access a client site. The issue was not caused by a cyber attack or malicious activity, but by a minor change affecting a file used by Cloudflare. Bot management security system.
Cloudflare Bot Management includes a machine learning model that generates bot “scores” for any request passing through the network. These ratings are used by clients to allow or deny bots access to their sites. It relies on a feature configuration file that the model uses to predict whether a request is automated or not, and because the bot landscape is so dynamic, it is updated and activated every few minutes specifically so Cloudflare can respond to new bots and attacks.
The failure occurred due to a database system permission change that caused the specified database to output multiple entries to the function configuration file. The file size quickly grew and, unfortunately, was distributed to all machines in the Cloudflare network. These machines, which route traffic across the network, were supposed to read the file to update the bot control system, but since their software has a limit on the size of the function file, this failed when a function file larger than expected appeared, causing the machines to crash.
DDoS confusion
Prince said Cloudflare's technical teams initially suspected they were seeing a hyperscale distributed denial of service (DDoS) attack for two reasons. First, coincidentally, Cloudflare's status page, which is hosted outside of its infrastructure and has no dependencies, accidentally went down. Second, early in the Cloudflare outage period there were short periods of apparent system recovery.
However, this was not the result of an attacker – rather, it was because the function file was being created every five minutes by a query running on the ClickHouse database cluster, which itself was in the process of being upgraded to improve permission management.
Thus, the questionable file was only created if the query was executed on the updated part of the cluster, so every five minutes there was a chance that either normal or anomalous object files would be created and propagated.
“These fluctuations made it unclear what was going on as the entire system would recover and then fail again as sometimes good and sometimes bad configuration files were propagated into our network,” Prince said. “Initially, this led us to believe that this might have been caused by an attack. Ultimately, each ClickHouse node was generating an incorrect configuration file, and the oscillations stabilized in a failed state.”
These errors continued until the technical team was able to identify the problem and resolve it by stopping the creation and distribution of the file with the bad functionality, manually inserting a “known good” file into the distribution queue, and then turning the primary proxy off and on again. After this, the situation began to return to normal starting at 14:30, and the number of base errors on the Cloudflare network returned to normal after about two and a half hours.
Risk and resilience
While Cloudflare itself was not attacked by an attacker, the outage is still a significant cyber risk issue, and lessons should be learned not just at Cloudflare, but across all organizations, whether they are customers or not. It exposed a deeper systemic risk that too much of the Internet's infrastructure rests on just a few shoulders.
Ryan Polk, director of policy for the American non-profit organization The Internet communitystated that market concentration among content delivery networks (CDNs) has been steadily increasing since 2020: “CDNs offer clear benefits – they improve reliability, reduce latency and reduce backhaul demand. However, when too much Internet traffic is concentrated in the hands of a few providers, these networks can become single points of failure that disrupt access to large parts of the Internet.”
“Organizations must evaluate the resilience of the services they rely on and examine their supply chains. Which systems and vendors are critical to their operations? Where are single points of failure? Companies must explore ways to diversify, such as using multiple cloud, CDN, or authentication providers, to reduce risk and improve overall resilience.”
Martin Greenfield, CEO of the company What a worldcontinuous monitoring platform, added: “When a single auto-generated configuration file can shut down a large portion of the Internet, this is not just a Cloudflare problem, but a fragility problem that has become integral to how organizations build their security stacks.
“Automation makes security scalable, but when an automated configuration is instantly propagated across the global network, it also scales failures. What is missing in most organizations, and clearly missing here, is automated verification that verifies those configurations before they go live. Automation without confidence is fragility at scale, and relying on a single vendor cannot provide an effective resilience strategy.”
For his part, Prince said Cloudflare will take steps to reduce the likelihood of such an issue occurring again in the future. These include strengthening protection against accepting configuration files generated by Cloudflare in the same way as would be done for user-generated input, enabling global kill switches for functions, working to eliminate the possibility of core dumps or error reports that overload system resources, and checking failure modes for errors in all major proxy modules.






