zlacker

Roll back is not always the right answer. I can’t speak to its appropriateness in this particular situation of course, but sometimes “roll forward” is the better solution.

replies(2): >>echelo+U7 >>flamin+mb

>>lukeas+(OP)
You want to build a world where roll back is 95% the right thing to do. So that it almost always works and you don't even have to think about it.

During an incident, the incident lead should be able to say to your team's on call: "can you roll back? If so, roll back" and the oncall engineer should know if it's okay. By default it should be if you're writing code mindfully.

Certain well-understood migrations are the only cases where roll back might not be acceptable.

Always keep your services in "roll back able", "graceful fail", "fail open" state.

This requires tremendous engineering consciousness across the entire org. Every team must be a diligent custodian of this. And even then, it will sometimes break down.

Never make code changes you can't roll back from without reason and without informing the team. Service calls, data write formats, etc.

I've been in the line of billion dollar transaction value services for most of my career. And unfortunately I've been in billion dollar outages.

replies(1): >>drysar+Ol

>>lukeas+(OP)
Like the other poster said, roll back should be the right answer the vast majority of the time. But it's also important to recognize that roll forward should be a replacement for the deployment you decided not to roll back, not a parallel deployment through another system.

I won't say never, but a situation where the right answer to avoid a rollback (that it sounds like was technically fine to do, just undesirable from a security/business perspective) is a parallel deployment through a radioactive, global blast radius, near instantaneous deployment system that is under intense scrutiny after another recent outage should be about as probable as a bowl of petunias in orbit

replies(1): >>crote+2p

>>echelo+U7
"Fail open" state would have been improper here, as the system being impacted was a security-critical system: firewall rules.

It is absolutely the wrong approach to "fail open" when you can't run security-critical operations.

replies(1): >>echelo+cU

>>flamin+mb
Is a roll back even possible at Cloudflare's size?

With small deployments it usually isn't too difficult to re-deploy a previous commit. But once you get big enough you've got enough developers that half a dozen PRs will have been merged since the start of the incident and now. How viable is it to stop the world, undo everything, and start from scratch any time a deployment causes the tiniest issues?

Realistically the best you're going to get is merging a revert of the problematic changeset - but with the intervening merges that's still going to bring the system in a novel state. You're rolling forwards, not backwards.

replies(4): >>newsof+1q >>yuliyp+xr >>gabrie+AM >>jameso+NO

>>crote+2p
If companies like Cloudflare haven't figured out how to do reliable rollbacks, there seems little hope for any of us.

>>crote+2p
I'd presume they have the ability to deploy a previous artifact vs only tip-of-master.

>>crote+2p
That will depend on how you structure your deployments, on some large tech companies, while thousands of changes little are made every hour, and deployments are mande in n-day cycles. A cut-off point in time is made where the first 'green' commit after that is picked for the current deployment, and if that fails in an unexpected way you just deploy the last binary back, fix (and test) whatever broke and either try again or just abandon the release if the next cut is already close-by.

>>crote+2p
Disclosure: Former Cloudflare SRE.

The short answer is "yes" due to the way the configuration management works. Other infrastructure changes or service upgrades might get undone, but it's possible. Or otherwise revert the commit that introduced the package bump with the new code and force that to rollout everywhere rather than waiting for progressive rollout.

There shouldn't be much chance of bringing the system to a novel state because configuration management will largely put things into the correct state. (Where that doesn't work is if CM previously created files, it won't delete them unless explicitly told to do so.)

replies(1): >>mewpme+vd1

>>drysar+Ol
Cloudflare is supposed to protect me from occasional ddos, not take my business offline entirely.

This can be architected in such a way that if one rules engine crashes, other systems are not impacted and other rules, cached rules, heuristics, global policies, etc. continue to function and provide shielding.

You can't ask for Cloudflare to turn on a dime and implement this in this manner. Their infra is probably very sensibly architected by great engineers. But there are always holes, especially when moving fast, migrating systems, etc. And there's probably room for more resiliency.

>>jameso+NO
> service upgrades might get undone, but it's possible.

But who knows what issues might reverting other team's stuff bring?