Cloudflare outage on December 5, 2025

>>meetpa+(OP)
This is architectural problem, the LUA bug, the longer global outage last week, a long list of earlier such outages only uncover the problem with architecture underneath. The original, distributed, decentralized web architecture with heterogeneous endpoints managed by myriad of organisations is much more resistant to this kind of global outages. Homogeneous systems like Cloudflare will continue to cause global outages. Rust won't help, people will always make mistakes, also in Rust. Robust architecture addresses this by not allowing a single mistake to bring down myriad of unrelated services at once.

>>mixedb+Xv1
In other words, the consolidation on Cloudflare and AWS makes the web less stable. I agree.

>>WD-42+qw1
Usually I am allergic to pithy, vaguely dogmatic summaries like this but you're right. We have traded "some sites are down some of the time" for "most sites are down some of the time". Sure the "some" is eliding an order of magnitude or two, but this framing remains directionally correct.

>>amazin+5z1
Does relying on larger players result in better overall uptime for smaller players? AWS is providing me better uptime than if I assembled something myself because I am less resourced and less talented than that massive team.

If so, is it a good or bad trade to have more overall uptime but when things go down it all goes down together?

>>PullJo+VA1
When only one thing goes down, it's easier to compensate with something else, even for people who are doing critical work but who can't fix IT problems themselves. It means there are ways the non-technical workforce can figure out to keep working, even if the organization doesn't have on-site IT.

Also, if you need to switchover to backup systems for everything at once, then either the backup has to be the same for everything and very easily implementable remotely - which to me seems unlikely for specialty systems, like hospital systems, or for the old tech that so many organizations still rely on (and remember the CrowdStrike BSODs that had to be fixed individually and in person and so took forever to fix?) - or you're gonna need a LOT of well-trained IT people, paid to be on standby constantly, if you want to fix the problems quickly, on account of they can't be everywhere at once.

If the problems are more spread out over time, then you don't need to have quite so many IT people constantly on standby. Saves a lot of $$$, I'd think.

And if problems are smaller and more spread out over time, then an organization can learn how to deal with them regularly, as opposed to potentially beginning to feel and behave as though the problem will never actually happen. And if they DO fuck up their preparedness/response, the consequences are likely less severe.

zlacker