zlacker

[return to "Cloudflare outage on December 5, 2025"]
1. flamin+q3[view] [source] 2025-12-05 15:49:27
>>meetpa+(OP)
What's the culture like at Cloudflare re: ops/deployment safety?

They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place

◧◩
2. deadba+X3[view] [source] 2025-12-05 15:51:32
>>flamin+q3
As usual, Cloudflare is the man in the arena.
◧◩◪
3. samrus+i6[view] [source] 2025-12-05 15:59:44
>>deadba+X3
There are other men in the arena who arent tripping on their own feet
◧◩◪◨
4. usrnm+S7[view] [source] 2025-12-05 16:05:54
>>samrus+i6
Like who? Which large tech company doesn't have outages?
◧◩◪◨⬒
5. k8sToG+h9[view] [source] 2025-12-05 16:10:40
>>usrnm+S7
It's not about outages. It's about the why. Hardware can fail. Bugs can happen. But to continue a roll out despite warning sings and without understanding the cause and impact is on another level. Especially if it is related to the same problem as last time.
◧◩◪◨⬒⬓
6. udev40+kk[view] [source] 2025-12-05 16:56:14
>>k8sToG+h9
And yet, it's always clownflare breaking everything. Failures are inevitable, which is widely known, therefore we build resilience systems to overcome the inevitable
[go to top]