They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?
Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.
Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place
From a more tinfoil-wearing angle, it may not even be a regular deployment, given the idea of Cloudflare being "the largest MitM attack in history". ("Maybe not even by Cloudflare but by NSA", would say some conspiracy theorists, which is, of course, completely bonkers: NSA is supposed to employ engineers who never let such blunders blow their cover.)
Also there seems to be insufficient testing before deployment with very junior level mistakes.
> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:
Where was the testing for this one? If ANY exception happened during the rules checking, the deployment should fail and rollback. Instead, they didn't assess that as a likely risk and pressed on with the deployment "fix".
I guess those at Cloudflare are not learning anything from the previous disaster.
In this case it's not just a matter of 'hold back for another day to make sure it's done right', like when adding a new feature to a normal SaaS application. In Cloudflare's case moving slower also comes with a real cost.
That isn't to say it didn't work out badly this time, just that the calculation is a bit different.
I'm not sure of the nature of the rollback process in this case, but leaning on ill-founded assumptions is a bad practice. I do agree that a global rollout is a problem.
“We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.
“We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization.”
During an incident, the incident lead should be able to say to your team's on call: "can you roll back? If so, roll back" and the oncall engineer should know if it's okay. By default it should be if you're writing code mindfully.
Certain well-understood migrations are the only cases where roll back might not be acceptable.
Always keep your services in "roll back able", "graceful fail", "fail open" state.
This requires tremendous engineering consciousness across the entire org. Every team must be a diligent custodian of this. And even then, it will sometimes break down.
Never make code changes you can't roll back from without reason and without informing the team. Service calls, data write formats, etc.
I've been in the line of billion dollar transaction value services for most of my career. And unfortunately I've been in billion dollar outages.
However, this preliminary report doesn't really justify the decision to use the same deployment system responsible for the 11/18 outage. Deployment safety should have been the focus of this report, not the technical details. My question that I want answered isn't "are there bugs in Cloudflare's systems" it's "has Cloudflare learned from it's recent mistakes to respond appropriately to events"
I won't say never, but a situation where the right answer to avoid a rollback (that it sounds like was technically fine to do, just undesirable from a security/business perspective) is a parallel deployment through a radioactive, global blast radius, near instantaneous deployment system that is under intense scrutiny after another recent outage should be about as probable as a bowl of petunias in orbit
Ouch. Harsh given that Cloudflare's being over-honest (to disabling the internal tool) and the outage's relatively limited impact (time wise & no. of customers wise). It was just an unfortunate latent bug: Nov 18 was Rust's Unwrap, Dec 5 its Lua's turn with its dynamic typing.
Now, the real cowboy decision I want to see is Cloudflare [0] running a company-wide Rust/Lua code-review with Codex / Claude...
cf TFA:
if rule_result.action == "execute" then
rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
end
This code expects that, if the ruleset has action="execute", the "rule_result.execute" object will exist ... error in the [Lua] code, which had existed undetected for many years ... prevented by languages with strong type systems. In our replacement [FL2 proxy] ... code written in Rust ... the error did not occur.
[0] >>44159166
Privately Disclosed: Nov 29 Fix pushed: Dec 1 Publicly disclosed: Dec 3
Cloudflare made it less of an expedite.
It is absolutely the wrong approach to "fail open" when you can't run security-critical operations.
There’s no other deployment system available. There’s a single system for config deployment and it’s all that was available as they haven’t yet done the progressive roll out implementation yet.
Note that the two deployments were of different components.
Basically, imagine the following scenario: A patch for a critical vulnerability gets released, during rollout you get a few reports of it causing the screensaver to show a corrupt video buffer instead, you roll out a GPO to use a blank screensaver instead of the intended corporate branding, a crash in a script parsing the GPOs on this new value prevents users from logging in.
There's no direct technical link between the two issues. A mitigation of the first one merely exposed a latent bug in the second one. In hindsight it is easy to say that the right approach is obviously to roll back, but in practice a roll forward is often the better choice - both from an ops perspective and from a safety perspective.
Given the above scenario, how many people are genuinely willing to do a full rollback, file a ticket with Microsoft, and hope they'll get around to fixing it some time soon? I think in practice the vast majority of us will just look for a suitable temporary workaround instead.
With small deployments it usually isn't too difficult to re-deploy a previous commit. But once you get big enough you've got enough developers that half a dozen PRs will have been merged since the start of the incident and now. How viable is it to stop the world, undo everything, and start from scratch any time a deployment causes the tiniest issues?
Realistically the best you're going to get is merging a revert of the problematic changeset - but with the intervening merges that's still going to bring the system in a novel state. You're rolling forwards, not backwards.
That's to say, it's an incredibly good idea when you can physically implement it. It's not something that everybody can do.
In this case they got unlucky with an incident before they finished work on planned changes from the last incident.
I’m happy to see they’re changing their systems to fail open which is one of the things I mentioned in the conversation about their last outage.
Hindsight is always 20/20, but I don't know how that sort of oversight could happen in an organization whose business model rides on reliability. Small shops understand the importance of safeguards such as progressive deployments or one-box-style deployments with a baking period, so why not the likes of Cloudflare? Don't they have anyone on their payroll who warns about the risks of global deployments without safeguards?
There is another name for rolling forward, it's called tripping up.
This is specious reasoning. How come I had to endure a total outage due to the rollout of a mitigation of a Nextjs vulnerability when my organization doesn't even own any React app, let alone a Nextjs one?
Also specious reasoning #2, not wanting to maintain a service does not justify blindly rolling out config changes globally without any safeguards.
The short answer is "yes" due to the way the configuration management works. Other infrastructure changes or service upgrades might get undone, but it's possible. Or otherwise revert the commit that introduced the package bump with the new code and force that to rollout everywhere rather than waiting for progressive rollout.
There shouldn't be much chance of bringing the system to a novel state because configuration management will largely put things into the correct state. (Where that doesn't work is if CM previously created files, it won't delete them unless explicitly told to do so.)
As a recovering devops/infra person from a lifetime ago (who has, much to my heartbreak, broken prod more than once), perhaps that is where my grace in this regard comes from. Systems and their components break, systems and processes are imperfect, and urgency can lead to unexpected failure. Sometimes its Cloudflare, other times it's Azure, GCP, Github, etc. You can always use something else, but most of us continue to pick the happy path of "it works most of the time, and sometimes it does not." Hopefully the post mortem has action items to improve the safeguards you mention. If there are no process and technical improvements from the outage, certainly, that is where the failure lies (imho).
China-nexus cyber threat groups rapidly exploit React2Shell vulnerability (CVE-2025-55182) - https://aws.amazon.com/blogs/security/china-nexus-cyber-thre... - December 4th, 2025
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
This can be architected in such a way that if one rules engine crashes, other systems are not impacted and other rules, cached rules, heuristics, global policies, etc. continue to function and provide shielding.
You can't ask for Cloudflare to turn on a dime and implement this in this manner. Their infra is probably very sensibly architected by great engineers. But there are always holes, especially when moving fast, migrating systems, etc. And there's probably room for more resiliency.
Particularly if we're asking them to be careful & deliberate about deployments, hard to ask them fast-track this.
But who knows what issues might reverting other team's stuff bring?
I think your take is terribly simplistic. In a professional setting, virtually all engineers have no say on whether the company switches platforms or providers. Their responsibility is to maintain and develop services that support business. The call to switch a provider is ultimately a business and strategic call, and is a subject that has extremely high inertia. You hired people specialized in technologies, and now you're just dumping all that investment? Not to mention contracts. Think about the problem this creates.
Some of you sound like amateurs toying with pet projects, where today it's framework A on cloud provider X whereas tomorrow it's framework B on cloud provider Y. Come the next day, rinse and repeat. This is unthinkable in any remotely professional setting.
And on top of that, Cloudflare's value proposition is "we're smart enough to know that instantaneous global deployments are a bad idea, so trust us to manage services for you so you don't have to rely on in house folks who might not know better"
Vendor contracts have 1-3 year terms. We (a financial services firm) re-evaluate tech vendors every year for potential replacement and technologists have direct input into these processes. I understand others may operate under a different vendor strategy. As a vendor customer, your choices are to remain a customer or to leave and find another vendor. These are not feelings, these are facts. If you are unhappy but choose not to leave a vendor, that is a choice, but it is your choice to make, and unless you are a large enough customer that you have leverage over the vendor, these are your only options.
Disclosure: I work at Cloudflare, but not on the WAF