zlacker

[return to "Cloudflare outage on December 5, 2025"]
1. w10-1+aw[view] [source] 2025-12-05 17:47:25
>>meetpa+(OP)
Kudos to Cloudflare for clarity and diligence.

When talking of their earlier Lua code:

> we have never before applied a killswitch to a rule with an action of “execute”.

I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?

It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.

I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.

◧◩
2. braiam+qD[view] [source] 2025-12-05 18:20:24
>>w10-1+aw
This is funny, considering that someone that worked on the defense industry (guide missile system) found a memory leak on one of their products, at that time. They told him that they knew about it, but that it's timed just right with the range of the system it would be used, so it doesn't matter.
◧◩◪
3. Ethery+GK[view] [source] 2025-12-05 18:51:20
>>braiam+qD
This paraphrased urban legend has nothing to do with quality engineering though? As described, it's designed to the spec and working as intended.
◧◩◪◨
4. mikkup+ql1[view] [source] 2025-12-05 21:47:19
>>Ethery+GK
It tracks with my experience in software quality engineering. Asked to find problems with something already working well in the field. Dutifully find bugs/etc. Get told that it's working though so nobody will change anything. In dysfunctional companies, which is probably most of them, quality engineering exists to cover asses, not to actually guide development.
◧◩◪◨⬒
5. colech+Ru1[view] [source] 2025-12-05 22:47:36
>>mikkup+ql1
It is not dysfunctional to ignore unreachable "bugs". A memory leak on a missile which won't be reached because it will explode long before that amount of time has passed is not a bug.
◧◩◪◨⬒⬓
6. wkat42+ww1[view] [source] 2025-12-05 22:58:21
>>colech+Ru1
It's a debt though. Because people will forget it's there and then at some point someone changes a counter from milliseconds to microseconds and then the issue happens 1000 times sooner.

It's never right to leave structural issues even if "they don't happen under normal conditions".

◧◩◪◨⬒⬓⬔
7. Ethery+Jz1[view] [source] 2025-12-05 23:20:31
>>wkat42+ww1
I don't think this argument makes sense. You wouldn't provision a 100GB server for a service where 1GB would do just in case unexpected conditions come up. If the requirements change, then the setup can change, doing it just because is wasteful. What if we forget is not a valid argument to over engineer and over provision.
◧◩◪◨⬒⬓⬔⧯
8. datadr+LD1[view] [source] 2025-12-05 23:51:17
>>Ethery+Jz1
If a fix is relatively low cost and improves the software in a way that makes it easier to modify in the future, it makes it easier to change the requirements. In aggregate these pay off.
◧◩◪◨⬒⬓⬔⧯▣
9. TOMDM+GF1[view] [source] 2025-12-06 00:07:44
>>datadr+LD1
This is all relative though.

If a missile passes the long hurdles and hoops built into modern Defence T&E procurement it will only ever be considered out of spec once it fails.

For a good portion of platforms they will go into service, be used for a decade or longer, and not once will the design be modified before going end of life and replaced.

If you wanted to progressively iterate or improve on these platforms, then yes continual updates and investing in the eradication of tech debt is well worth the cost.

If you're strapping explosives attached to a rocket engine to your vehicle and pointing it at someone, there is merit in knowing it will behave exactly the same way it has done the past 1000 times.

Neither ethos in modifying a system is necessarily wrong, but you do have to choose which you're going with, and what the merits and drawbacks of that are.

[go to top]