zlacker

Kudos to Cloudflare for clarity and diligence.

When talking of their earlier Lua code:

> we have never before applied a killswitch to a rule with an action of “execute”.

I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?

It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.

I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.

replies(3): >>braiam+g7 >>zwnow+X9 >>ifwint+wM1

>>w10-1+(OP)
This is funny, considering that someone that worked on the defense industry (guide missile system) found a memory leak on one of their products, at that time. They told him that they knew about it, but that it's timed just right with the range of the system it would be used, so it doesn't matter.

replies(4): >>mopsi+Zd >>Ethery+we >>sally_+eO >>runlas+A11

>>w10-1+(OP)
"Kudos"? This is like the South Park episode in which the oil company guy just excuses himself while the company just continues to fuck up over and over again. There's nothing to praise, this shouldn't happen twice in a month. Its inexcusable.

replies(1): >>vpShan+vo

>>braiam+g7
... until the extended-range version is ordered and no one remembers to fix the leak. :]

replies(2): >>wizzwi+Jo >>hinkle+Py

>>braiam+g7
This paraphrased urban legend has nothing to do with quality engineering though? As described, it's designed to the spec and working as intended.

replies(1): >>mikkup+gP

>>zwnow+X9
twice in a month _so far_

replies(3): >>Bengal+Vq >>hinkle+tA >>kordle+Jx2

>>mopsi+Zd
They will remember, because it'll have been measured and documented, rigorously.

replies(2): >>Sketch+yq >>hinkle+0z

>>wizzwi+Jo
I've found that the real trick with documentation isn't creation, it's discovery. I wonder how that information is easily found afterwards.

replies(5): >>lloeki+3s >>hinkle+gz >>wizzwi+sE >>colech+GZ >>pclmul+Id2

>>vpShan+vo
Those AI agents are coding fast, or am I missing some obvious concept here?

>>Sketch+yq
By reading the documentation thoroughly as a compulsory first step to designing the next system that depends on it.

I realise this may probably boggle the mind of the modern software developer.

replies(4): >>lukan+Ox >>hinkle+oz >>switch+HA >>SkyPun+YK

>>lloeki+3s
You say this like trivial misstakes did not happen all the time in classical engineering as well.

If there is a memory leak, them this is a flaw, that might not matter so much for a specific product, but I can also easily see it being forgotten, if it was maybe mentioned somewhere in the documentation, but maybe not clear enough and deadlines and stress to ship are a thing there as well.

>>mopsi+Zd
Ariane 5 happens.

>>wizzwi+Jo
Was this one measured and documented rigorously?

Well obviously not, because the front fell off. That’s a dead giveaway.

>>Sketch+yq
If ownerless code doesn’t result in discoverability efforts then the whole thing goes off the rails.

I won’t remember this block of code because five other people have touched it. So I need to be able to see what has changed and what it talks to so I can quickly verify if my old assumptions still hold true

>>lloeki+3s
That is not how this usually works.

Most of the time QA can tell you exactly how the product works, regardless of what the documentation says. But many of us haven’t seen a QA team in five, ten years.

>>vpShan+vo
We still have two holidays and associated vacations and vacation brain to go. And then the January hangover.

Every company that has ignored my following advice has experienced a day for day slip in first quarter scheduling. And that advice is: not much work gets done between Dec 15 and Jan 15. You can rely on a week worth, more than that is optimistic. People are taking it easy and they need to verify things with someone who is on vacation so they are blocked. And when that person gets back, it’s two days until their vacation so it’s a crap shoot.

NB: there’s work happening on Jan 10, for certain, but it’s not getting finished until the 15th. People are often still cleaning up after bad decisions they made during the holidays and the subsequent hangover.

>>lloeki+3s
Just try harder. And if it still breaks, clearly you weren't trying hard enough!

At some point you have to admit that humans are pretty bad at some things. Keeping documentation up to date and coherent is one of those things, especially in the age of TikTok.

Better to live in the world we have and do the best you can, than to endlessly argue about how things should be but never will become.

replies(1): >>vimwiz+xE

>>Sketch+yq
For the new system to be approved, you need to document the properties of the software component that are deemed relevant. The software system uses dynamic allocation, so "what do the allocation patterns look like? are there leaks, risks of fragmentation, etc, and how do we characterise those?" is on the checklist. The new developer could try to figure this all out from scratch, but if they're copying the old system's code, they're most likely just going to copy the existing paperwork, with a cursory check to verify that their modifications haven't changed the properties.

They're going to see "oh, it leaks 3MiB per minute… and this system runs for twice as long as the old system", and then they're going to think for five seconds, copy-paste the appropriate paragraph, double the memory requirements in the new system's paperwork, and call it a day.

Checklists work.

>>switch+HA
> especially in the age of TikTok

Shouldn't grey beards, grizzled by years of practicing rigorous engineering, be passing this knowledge on to the next generation? How did they learn it when just starting out? They weren't born with it. Maybe engineering has actually improved so much that we only need to experience outages this frequently, and such feelings of nostalgia are born from never having to deal with systems having such high degrees of complexity and, realistically, 100% availability expectations on a global scale.

replies(2): >>spockz+EI >>switch+y51

>>vimwiz+xE
They may not have learned it but being thorough in general was more of a thing. These days things are far more rushed. And I say that as a relatively young engineer.

The amount of dedication and meticulous and concentrated work I know from older engineers when I started work and that I remember from my grand fathers is something I very rarely observe these days. Neither in engineering specific fields nor in general.

>>lloeki+3s
I used to take this approach when building new integrations. Then I realized (1) most documentation sucks (2) there's far too much to remember (3) much of it is conditional (4) you don't always know what matters until it matters (e.g. using different paths of implementation).

What works much better is having an intentional review step that you come back to.

>>braiam+g7
Having observed an average of two mgmt rotations at most of the clients our company is working for this comes at absolutely no surprise to me. Engineering is acting perfectly reasonable, optimizing for cost and time within the constraints they were given. Constraints are updated at a (marketing or investor pleasure) whim without consulting engineering, cue disaster. Not even surprising to me anymore...

>>Ethery+we
It tracks with my experience in software quality engineering. Asked to find problems with something already working well in the field. Dutifully find bugs/etc. Get told that it's working though so nobody will change anything. In dysfunctional companies, which is probably most of them, quality engineering exists to cover asses, not to actually guide development.

replies(1): >>colech+HY

>>mikkup+gP
It is not dysfunctional to ignore unreachable "bugs". A memory leak on a missile which won't be reached because it will explode long before that amount of time has passed is not a bug.

replies(2): >>wkat42+m01 >>mikkup+ZH2

>>Sketch+yq
>I wonder how that information is easily found afterwards.

Military hardware is produced with engineering design practices that look nothing at all like what most of the HN crowd is used to. There is an extraordinary amount of documentation, requirements, and validation done for everything.

There is a MIL-SPEC for pop tarts which defines all parts sizes, tolerances, etc.

Unlike a lot in the software world military hardware gets DONE with design and then they just manufacture it.

>>colech+HY
It's a debt though. Because people will forget it's there and then at some point someone changes a counter from milliseconds to microseconds and then the issue happens 1000 times sooner.

It's never right to leave structural issues even if "they don't happen under normal conditions".

replies(2): >>Ethery+z31 >>jjmarr+Xj1

>>braiam+g7
My hunch is that we do the same with memory leaks or other bugs in web applications where the time of a request is short.

>>wkat42+m01
I don't think this argument makes sense. You wouldn't provision a 100GB server for a service where 1GB would do just in case unexpected conditions come up. If the requirements change, then the setup can change, doing it just because is wasteful. What if we forget is not a valid argument to over engineer and over provision.

replies(1): >>datadr+B71

>>vimwiz+xE
We were talking about making a missile (v2) with an extended range, and ensuring that the developers who work on it understand the assumption of the prior model: that it doesn't use free because it's expected to blow up before that would become an issue (a perfectly valid approach, I might add). And to ensure that this assumption still holds in the v2 extended range model. The analogy to Ariane 5 is very apt.

Now, there can be tens of thousands of similar considerations to document. And keeping up that documentation with the actual state of the world is a full time job in itself.

You can argue all you want that folks "should" do this or that, but all I've seen in my entire career is that documentation is almost universally: out of date, and not worth relying on because it's actively steering you in the wrong direction. And I actually disagree (as someone with some gray in my beard) with your premise that this is part of "rigorous engineering" as is practiced today. I wish it was, but the reality is you have to read the code, read it again, see what it does on your desk, see what it does in the wild, and still not trust it.

We "should" be nice to each other, I "should" make more money, and it "should" be sunny more often. And we "should" have well written, accurate and reliable docs, but I'm too old to be waiting around for that day to come, especially in the age of zero attention and AI generated shite.

>>Ethery+z31
If a fix is relatively low cost and improves the software in a way that makes it easier to modify in the future, it makes it easier to change the requirements. In aggregate these pay off.

replies(1): >>TOMDM+w91

>>datadr+B71
This is all relative though.

If a missile passes the long hurdles and hoops built into modern Defence T&E procurement it will only ever be considered out of spec once it fails.

For a good portion of platforms they will go into service, be used for a decade or longer, and not once will the design be modified before going end of life and replaced.

If you wanted to progressively iterate or improve on these platforms, then yes continual updates and investing in the eradication of tech debt is well worth the cost.

If you're strapping explosives attached to a rocket engine to your vehicle and pointing it at someone, there is merit in knowing it will behave exactly the same way it has done the past 1000 times.

Neither ethos in modifying a system is necessarily wrong, but you do have to choose which you're going with, and what the merits and drawbacks of that are.

>>wkat42+m01
In hard real-time software, you have a performance budget otherwise the missile fails.

It might be more maintainable to have leaks instead of elaborate destruction routines, because then you only have to consider the costs of allocations.

Java has a null garbage collector (Sigma GC) for the same reason. If your financial application really needs good performance at any cost and you don't want to rewrite it, you can throw money at the problem to make it go away.

>>w10-1+(OP)
It's weird reading these reports because they don't seem to test anything at all (or at least there's very little mention of testing).

Canary deployment, testing environments, unit tests, integration tests, anything really?

It sounds like they test by merging directly to production but surely they don't

replies(2): >>Dumble+iT1 >>chippi+FL2

>>ifwint+wM1
In the post they described that they observed errors happening in their testing env, but decided to ignore because they were rolling out a security fix. I am sure there is more nuance to this, but I don’t know whether that makes it better or worse

replies(1): >>misswa+ih2

>>Sketch+yq
When people don't read the documentation, discovery is a real problem. When people do read the documentation, things are different. Many software engineers do not read the documentation, and then complain to you if they break something in a documented way. If you compare to hardware engineers, whose vendors put out tens of thousands of pages of documentation for single parts, they have a lot of skill at reading documentation (and the vendors at writing it).

>>Dumble+iT1
> but decided to ignore because they were rolling out a security fix.

A key part of secure systems is availability...

It really looks like vibe-coding.

>>vpShan+vo
reaching for that _one 9 of uptime_

>>colech+HY
The way it always seemed to go for me, when I was in that role, is the product is already complete, development is done, you're handed all the tests/etc that the disinterested developers care to give you, and you're told to make those tests presentable and robust, and increase test coverage. The process of doing that inevitably uncovers issues, but nobody cares because the thing is already done and working, so what was the point of any of it? The point was just to check off a box. At companies like this, the role is bullshit work.

>>ifwint+wM1
The problem is that Cloudflare do incremental rollouts and loads of testing for _code_. But they don't do the same thing for configuration - they globally push out changes because they want rapid response.

It's still a bit silly though, their claimed reasoning probably doesn't really stack up for most of their config changes - I don't see it to be that likely that a 0.1->1->10->100 rollout over the period of 10 minutes would be a catastrophically bad idea for them for _most_ changes.

And to their credit, it does seem they want to change that.

replies(1): >>ifwint+Tj3

>>chippi+FL2
Yeah to me it doesn't make any sense - configuration changes are just as likely to break stuff (as they've discovered the hard way) and both of these issues could have been found in a testing environment before being deployed to production