zlacker

[return to "Cloudflare outage on December 5, 2025"]
1. paradi+q5[view] [source] 2025-12-05 15:56:37
>>meetpa+(OP)
The deployment pattern from Cloudflare looks insane to me.

I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.

The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.

I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.

For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.

◧◩
2. dehrma+Tv[view] [source] 2025-12-05 17:46:10
>>paradi+q5
Cloudflare is orders of magnitude larger than any fintech. Rollouts likely take much longer, and having a human monitoring a dashboard doesn't scale.
◧◩◪
3. notepa+XA[view] [source] 2025-12-05 18:08:08
>>dehrma+Tv
That means they engineered their systems incorrectly then? Precisely because they are much bigger, they should be more resilient. You know who's bigger than Cloudflare? tier-1 ISPs, if they had an outage the whole internet would know about it, and they do have outages except they don't cascade into a global mess like this.

Just speculating based on my experience: It's more likely than not that they likely refused to invest in fail-safe architectures for cost reasons. Control-plane and data-plane should be separate, a react patch shouldn't affect traffic forwarding.

Forget manual rollbacks, there should be automated reversion to a known working state.

[go to top]