You're changing the subject here and shifting focus from the specific to the vague. The two postmortems after the recent major Cloudflare outages both listed straightforward errors in source code that could have been tested and detected.
Theoretical outages could theoretically have other causes, but these two specific outages had specific causes that we know.
> which is why robust and fast rollback procedures are usually desirable and implemented.
Yes, nobody is arguing against that. It's a red herring with regard to my point about source code testing.
With all due respect, it sounds like you have not worked on these types of systems, but out of curiosity - what type of test do you think would have prevented this?
Cloudflare states that the compiler would prevent the bug in certain programming languages. So it seems ridiculous to suggest that the bug can't be detected outside the scale of a larger system.
2024 revenue figures were $1.669 billion for Cloudflare, and $3.99 billion for Akamai, per Wikipedia.
if rule_result.action == "execute" then
rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
end
"This code expects that, if the ruleset has action=”execute”, the “rule_result.execute” object will exist. However, because the rule had been skipped, the rule_result.execute object did not exist, and Lua returned an error due to attempting to look up a value in a nil value.This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur."
The unit tests are for the source code. In this respect, the number of requests a second fielded by the system is irrelevant. Unit tests don't happen in production; that's the point of them.
It's a classic coding mistake, failing to check for nil, and none of your handwaving about "scale" changes that fact.
That does not sound right to me. “20 percent of websites” does not mean “20 percent of traffic.”.
There is no public write-up from Cloudflare that proves “we handle 20% of all Internet traffic.” Cloudflare reports around 295,000 paying customers and more than 30 million Internet properties (20% of the web). So most of their users are on the free plan.
I can't believe they only have 295,000 paying customer, that puts me in a small minority. lol