Cloudflare outage on December 5, 2025

>>meetpa+(OP)
What I'm missing here is a test environment. Gradual or not; why are they deploying straight to prod? At Cloudflare's scale, there should be a dedicated room in Cloudflare HQ with a full isolated model-scale deployment of their entire system. All changes should go there first, with tests run for every possible scenario.

Only after that do you use gradual deployment, with a big red oopsie button which immediately rolls the changes back. Languages with strong type systems won't save you, good procedure will.

>>uyzstv+6m
I am sure they have this. What tends to happen is that the gradual rollout system becomes too slow for some rare, low latency rollout requirements, so a config system is introduced that fulfills the requirements. For example, let’s say you have a gradual rollout for binaries (slow) and configuration (fast). Over time, the fast rollout of the configuration system will cause outages, so it’s slowed down. Then a requirement pops up for which the config system is too slow and someone identifies a global system with no gradual rollout (e.g. a database) to be used as the solution. That solution will be compliant with all the processes that have been introduced to the letter, because so far nobody has thought of using a single database row for global configuration yet. Add new processes whenever this happens and at some point everything will be too slow and taking on more risk becomes necessary to stay competitive. So processes are adjusted. Repeat forever.

zlacker