Cloudflare's proxy service has limits to prevent excessive memory consumption, and the bot management system has a “limit on the number of machine learning functions that can be used at runtime.” This limit is 200, which is significantly higher than the actual number of functions used.
“When a bad file with over 200 functions was distributed to our servers, this limit was exceeded, causing the system to panic” and throw errors, Prince wrote.
Worst Cloudflare outage since 2019
The number of HTTP 5xx error status codes served by Cloudflare's network is usually “very low” but increased sharply after a bad file spread across the network. “The spike and subsequent fluctuations indicate that our system is crashing due to loading the wrong function file,” Prince wrote. “What's remarkable is that our system then took some time to recover. This was very unusual behavior for an internal error.”
This unusual behavior was explained by the fact that “the file was created every five minutes by a query running on the ClickHouse database cluster, which was gradually updated to improve permission management,” Prince wrote. “Bad data was only generated if a query was executed on a part of the cluster that was being upgraded. As a result, every five minutes there was a chance that a good or bad set of configuration files would be generated and quickly spread across the network.”
Initially, these fluctuations “led us to believe that this could be caused by an attack. Ultimately, each ClickHouse node generated an incorrect configuration file, and the fluctuations stabilized in a crashed state,” he wrote.
Prince said Cloudflare “resolved the issue by stopping the generation and distribution of the bad feature file and manually inserting a known good file into the feature file distribution queue” and then “forcing a restart of our primary proxy.” The team then worked to “restart the remaining services that had entered a bad state” until the 5xx error code volume returned to normal later that day.
Prince said the outage was the worst at Cloudflare since 2019 and that the firm is taking steps to protect against similar failures in the future. Cloudflare will work to “improve the acceptance of Cloudflare-generated configuration files in the same way we do for user-generated input; providing more global kill switches for functions; eliminating the possibility of core dumps or other error reports that overload system resources; [and] analysis of failure modes for errors in all major proxy modules,” Prince said.
While Prince can't promise that Cloudflare will never experience another outage of this magnitude, he said previous outages have “always led to the creation of new, more resilient systems.”





