zlacker

[return to "Cloudflare outage on December 5, 2025"]
1. w10-1+aw[view] [source] 2025-12-05 17:47:25
>>meetpa+(OP)
Kudos to Cloudflare for clarity and diligence.

When talking of their earlier Lua code:

> we have never before applied a killswitch to a rule with an action of “execute”.

I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?

It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.

I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.

◧◩
2. braiam+qD[view] [source] 2025-12-05 18:20:24
>>w10-1+aw
This is funny, considering that someone that worked on the defense industry (guide missile system) found a memory leak on one of their products, at that time. They told him that they knew about it, but that it's timed just right with the range of the system it would be used, so it doesn't matter.
◧◩◪
3. mopsi+9K[view] [source] 2025-12-05 18:49:22
>>braiam+qD
... until the extended-range version is ordered and no one remembers to fix the leak. :]
◧◩◪◨
4. wizzwi+TU[view] [source] 2025-12-05 19:33:49
>>mopsi+9K
They will remember, because it'll have been measured and documented, rigorously.
◧◩◪◨⬒
5. Sketch+IW[view] [source] 2025-12-05 19:43:40
>>wizzwi+TU
I've found that the real trick with documentation isn't creation, it's discovery. I wonder how that information is easily found afterwards.
◧◩◪◨⬒⬓
6. wizzwi+Ca1[view] [source] 2025-12-05 20:54:11
>>Sketch+IW
For the new system to be approved, you need to document the properties of the software component that are deemed relevant. The software system uses dynamic allocation, so "what do the allocation patterns look like? are there leaks, risks of fragmentation, etc, and how do we characterise those?" is on the checklist. The new developer could try to figure this all out from scratch, but if they're copying the old system's code, they're most likely just going to copy the existing paperwork, with a cursory check to verify that their modifications haven't changed the properties.

They're going to see "oh, it leaks 3MiB per minute… and this system runs for twice as long as the old system", and then they're going to think for five seconds, copy-paste the appropriate paragraph, double the memory requirements in the new system's paperwork, and call it a day.

Checklists work.

[go to top]