zlacker

[return to "Cloudflare outage on December 5, 2025"]
1. w10-1+aw[view] [source] 2025-12-05 17:47:25
>>meetpa+(OP)
Kudos to Cloudflare for clarity and diligence.

When talking of their earlier Lua code:

> we have never before applied a killswitch to a rule with an action of “execute”.

I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?

It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.

I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.

◧◩
2. braiam+qD[view] [source] 2025-12-05 18:20:24
>>w10-1+aw
This is funny, considering that someone that worked on the defense industry (guide missile system) found a memory leak on one of their products, at that time. They told him that they knew about it, but that it's timed just right with the range of the system it would be used, so it doesn't matter.
◧◩◪
3. mopsi+9K[view] [source] 2025-12-05 18:49:22
>>braiam+qD
... until the extended-range version is ordered and no one remembers to fix the leak. :]
◧◩◪◨
4. wizzwi+TU[view] [source] 2025-12-05 19:33:49
>>mopsi+9K
They will remember, because it'll have been measured and documented, rigorously.
◧◩◪◨⬒
5. Sketch+IW[view] [source] 2025-12-05 19:43:40
>>wizzwi+TU
I've found that the real trick with documentation isn't creation, it's discovery. I wonder how that information is easily found afterwards.
◧◩◪◨⬒⬓
6. lloeki+dY[view] [source] 2025-12-05 19:50:59
>>Sketch+IW
By reading the documentation thoroughly as a compulsory first step to designing the next system that depends on it.

I realise this may probably boggle the mind of the modern software developer.

◧◩◪◨⬒⬓⬔
7. switch+R61[view] [source] 2025-12-05 20:36:59
>>lloeki+dY
Just try harder. And if it still breaks, clearly you weren't trying hard enough!

At some point you have to admit that humans are pretty bad at some things. Keeping documentation up to date and coherent is one of those things, especially in the age of TikTok.

Better to live in the world we have and do the best you can, than to endlessly argue about how things should be but never will become.

◧◩◪◨⬒⬓⬔⧯
8. vimwiz+Ha1[view] [source] 2025-12-05 20:54:55
>>switch+R61
> especially in the age of TikTok

Shouldn't grey beards, grizzled by years of practicing rigorous engineering, be passing this knowledge on to the next generation? How did they learn it when just starting out? They weren't born with it. Maybe engineering has actually improved so much that we only need to experience outages this frequently, and such feelings of nostalgia are born from never having to deal with systems having such high degrees of complexity and, realistically, 100% availability expectations on a global scale.

◧◩◪◨⬒⬓⬔⧯▣
9. switch+IB1[view] [source] 2025-12-05 23:33:28
>>vimwiz+Ha1
We were talking about making a missile (v2) with an extended range, and ensuring that the developers who work on it understand the assumption of the prior model: that it doesn't use free because it's expected to blow up before that would become an issue (a perfectly valid approach, I might add). And to ensure that this assumption still holds in the v2 extended range model. The analogy to Ariane 5 is very apt.

Now, there can be tens of thousands of similar considerations to document. And keeping up that documentation with the actual state of the world is a full time job in itself.

You can argue all you want that folks "should" do this or that, but all I've seen in my entire career is that documentation is almost universally: out of date, and not worth relying on because it's actively steering you in the wrong direction. And I actually disagree (as someone with some gray in my beard) with your premise that this is part of "rigorous engineering" as is practiced today. I wish it was, but the reality is you have to read the code, read it again, see what it does on your desk, see what it does in the wild, and still not trust it.

We "should" be nice to each other, I "should" make more money, and it "should" be sunny more often. And we "should" have well written, accurate and reliable docs, but I'm too old to be waiting around for that day to come, especially in the age of zero attention and AI generated shite.

[go to top]