When talking of their earlier Lua code:
> we have never before applied a killswitch to a rule with an action of “execute”.
I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?
It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.
I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.
I realise this may probably boggle the mind of the modern software developer.
If there is a memory leak, them this is a flaw, that might not matter so much for a specific product, but I can also easily see it being forgotten, if it was maybe mentioned somewhere in the documentation, but maybe not clear enough and deadlines and stress to ship are a thing there as well.
Well obviously not, because the front fell off. That’s a dead giveaway.
I won’t remember this block of code because five other people have touched it. So I need to be able to see what has changed and what it talks to so I can quickly verify if my old assumptions still hold true
Most of the time QA can tell you exactly how the product works, regardless of what the documentation says. But many of us haven’t seen a QA team in five, ten years.
Every company that has ignored my following advice has experienced a day for day slip in first quarter scheduling. And that advice is: not much work gets done between Dec 15 and Jan 15. You can rely on a week worth, more than that is optimistic. People are taking it easy and they need to verify things with someone who is on vacation so they are blocked. And when that person gets back, it’s two days until their vacation so it’s a crap shoot.
NB: there’s work happening on Jan 10, for certain, but it’s not getting finished until the 15th. People are often still cleaning up after bad decisions they made during the holidays and the subsequent hangover.
At some point you have to admit that humans are pretty bad at some things. Keeping documentation up to date and coherent is one of those things, especially in the age of TikTok.
Better to live in the world we have and do the best you can, than to endlessly argue about how things should be but never will become.
They're going to see "oh, it leaks 3MiB per minute… and this system runs for twice as long as the old system", and then they're going to think for five seconds, copy-paste the appropriate paragraph, double the memory requirements in the new system's paperwork, and call it a day.
Checklists work.
Shouldn't grey beards, grizzled by years of practicing rigorous engineering, be passing this knowledge on to the next generation? How did they learn it when just starting out? They weren't born with it. Maybe engineering has actually improved so much that we only need to experience outages this frequently, and such feelings of nostalgia are born from never having to deal with systems having such high degrees of complexity and, realistically, 100% availability expectations on a global scale.
The amount of dedication and meticulous and concentrated work I know from older engineers when I started work and that I remember from my grand fathers is something I very rarely observe these days. Neither in engineering specific fields nor in general.
What works much better is having an intentional review step that you come back to.
Military hardware is produced with engineering design practices that look nothing at all like what most of the HN crowd is used to. There is an extraordinary amount of documentation, requirements, and validation done for everything.
There is a MIL-SPEC for pop tarts which defines all parts sizes, tolerances, etc.
Unlike a lot in the software world military hardware gets DONE with design and then they just manufacture it.
It's never right to leave structural issues even if "they don't happen under normal conditions".
Now, there can be tens of thousands of similar considerations to document. And keeping up that documentation with the actual state of the world is a full time job in itself.
You can argue all you want that folks "should" do this or that, but all I've seen in my entire career is that documentation is almost universally: out of date, and not worth relying on because it's actively steering you in the wrong direction. And I actually disagree (as someone with some gray in my beard) with your premise that this is part of "rigorous engineering" as is practiced today. I wish it was, but the reality is you have to read the code, read it again, see what it does on your desk, see what it does in the wild, and still not trust it.
We "should" be nice to each other, I "should" make more money, and it "should" be sunny more often. And we "should" have well written, accurate and reliable docs, but I'm too old to be waiting around for that day to come, especially in the age of zero attention and AI generated shite.
If a missile passes the long hurdles and hoops built into modern Defence T&E procurement it will only ever be considered out of spec once it fails.
For a good portion of platforms they will go into service, be used for a decade or longer, and not once will the design be modified before going end of life and replaced.
If you wanted to progressively iterate or improve on these platforms, then yes continual updates and investing in the eradication of tech debt is well worth the cost.
If you're strapping explosives attached to a rocket engine to your vehicle and pointing it at someone, there is merit in knowing it will behave exactly the same way it has done the past 1000 times.
Neither ethos in modifying a system is necessarily wrong, but you do have to choose which you're going with, and what the merits and drawbacks of that are.
It might be more maintainable to have leaks instead of elaborate destruction routines, because then you only have to consider the costs of allocations.
Java has a null garbage collector (Sigma GC) for the same reason. If your financial application really needs good performance at any cost and you don't want to rewrite it, you can throw money at the problem to make it go away.
Canary deployment, testing environments, unit tests, integration tests, anything really?
It sounds like they test by merging directly to production but surely they don't
A key part of secure systems is availability...
It really looks like vibe-coding.
It's still a bit silly though, their claimed reasoning probably doesn't really stack up for most of their config changes - I don't see it to be that likely that a 0.1->1->10->100 rollout over the period of 10 minutes would be a catastrophically bad idea for them for _most_ changes.
And to their credit, it does seem they want to change that.