zlacker

The deployment pattern from Cloudflare looks insane to me.

I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.

The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.

I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.

For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.

replies(5): >>theide+o7 >>markus+Ga >>dehrma+tq >>vlovic+tr >>nova22+DC

>>paradi+(OP)
Same, my time at a F100 ecommerce retailer showed me the same. Every change control board justification needed an explicit back-out/restoration plan with exact steps to be taken, what was being monitored to ensure that was being held to, contacts of prominent groups anticipated to have an effect, emergency numbers/rooms for quick conferences if in fact something did happen.

The process was pretty tight, almost no revenue-affecting outages from what I can remember because it was such a collaborative effort (even though the board presentation seemed a bit spiky and confrontational at the time, everyone was working together).

replies(2): >>prdona+aa >>lljk_k+lv

>>theide+o7
And you moved at a glacial pace compared to Cloudflare. There are tradeoffs.

replies(1): >>theide+7p

>>paradi+(OP)
My guess is that CF has so many external customers that they need to move fast and try not to break things. My hunch is that their culture always favors moving fast. As long as they are not breaking too many things, customers won't leave them.

replies(2): >>paradi+jb >>linhns+SE

>>markus+Ga
There is nothing wrong with moving fast and deploying fast.

I'm more talking about how slow it was to detect the issue caused by the config change, and perform the rollback of the config change. It took 20 minutes.

>>prdona+aa
Yes, of course, I want the organization that inserted itself into handling 20% of the world's internet traffic to move fast and break things. Like breaking the internet on a bi-weekly basis. Yep, great tradeoff there.

Give me a break.

replies(3): >>wvenab+gq >>jimmyd+9A >>Jeremy+8E

>>theide+7p
But if your job is mitigate attacks/issues then things can very broken while you're being slow to mitigate it.

>>paradi+(OP)
Cloudflare is orders of magnitude larger than any fintech. Rollouts likely take much longer, and having a human monitoring a dashboard doesn't scale.

replies(3): >>notepa+xv >>cowsan+WG >>paradi+Ra2

>>paradi+(OP)
That is also true at Cloudflare for what it’s worth. However, the company is so big that there’s so many different products all shipping at the same time it can be hard to correlate it to your release, especially since there’s a 5 min lag (if I recall correctly) in the monitoring dashboards to get all the telemetry from thousands of servers worldwide.

Comparing the difficulty of running the world’s internet traffic with hundreds of customer products with your fintech experience is like saying “I can lift 10 pounds. I don’t know why these guys are struggling to lift 500 pounds”.

replies(3): >>autoex+xF >>thepla+6c1 >>paradi+lb2

>>theide+o7
This sounds just as bad as yolo-merges, just on the other end of the spectrum.

>>dehrma+tq
That means they engineered their systems incorrectly then? Precisely because they are much bigger, they should be more resilient. You know who's bigger than Cloudflare? tier-1 ISPs, if they had an outage the whole internet would know about it, and they do have outages except they don't cascade into a global mess like this.

Just speculating based on my experience: It's more likely than not that they likely refused to invest in fail-safe architectures for cost reasons. Control-plane and data-plane should be separate, a react patch shouldn't affect traffic forwarding.

Forget manual rollbacks, there should be automated reversion to a known working state.

replies(1): >>vlovic+yI

>>theide+7p
While you're taking your break, exploits gain traction in the wild and one of the value propositions for using a service provider like CloudFlare is catching and mitigating theses exploits as fast as possible. From the OP, this outage was in relation to handling a nasty RCE.

>>paradi+(OP)
Speaking of fintech

https://www.henricodolfing.ch/case-study-4-the-440-million-s...

>>theide+7p
Lest we forget, they initially rose to prominence by being cheaper than the existing solutions, not better, and I suppose this is a tradeoff a lot of their customers are willing to make.

>>markus+Ga
I think everyone favors moving fast. We humans want to see results of our action early.

>>vlovic+tr
> However, the company is so big that there’s so many different products all shipping at the same time it can be hard to correlate it to your release

This kind of thing would be more understandable for a company without hundreds of billions of dollars, and for one that hasn't centralized so much of the internet. If a company has grown too large and complex to be well managed and effective and it's starting to look like a liability for large numbers of people there are obvious solutions for that.

replies(3): >>vlovic+7H >>pulkit+9K >>evanel+kV

>>dehrma+tq
> Rollouts likely take much longer

Cloudflare’s own post says the configuration change that resulted in the outage rolled out in seconds.

>>autoex+xF
Can you name a major cloud provider that doesn’t have major outages?

If this were purely a money problem it would have been solved ages ago. It’s a difficult problem to solve. Also, they’re the youngest of the major cloud providers and have a fraction of the resources that Google, Amazon, and Microsoft have.

replies(1): >>autoex+jJ

>>notepa+xv
> Control-plane and data-plane should be separate

They are separate.

> a react patch shouldn't affect traffic forwarding.

If you can’t even bother to read the blog post maybe you shouldn’t be so confident in your own analysis of what should and shouldn’t have happened?

This was a configuration change to change the buffered size of a body from 256kb to 1mib.

The ability to be so wrong in so few words with such confidence is impressive but you may want to take more of a curiosity first approach rather than reaction first.

replies(1): >>notepa+5U

>>vlovic+7H
> Can you name a major cloud provider that doesn’t have major outages?

That fact that no major cloud provider is actually good is not an argument that cloudflare isn't bad, or even that they couldn't/shouldn't do better than they are. They have fewer resources than Google or Microsoft but they're also in a unique position that makes us differently vulnerable when they fuck up. It's not all their fault, since it was a mistake to centralize the internet to the extent that we have in the first place, but now that they are responsible for so much they have to expect that people will be upset when they fail.

replies(1): >>vlovic+qh2

>>autoex+xF
Genuinely curious, how to actually implement detection systems for a large scale global infra which that works with < 1 minute SLO ? Given cost is no constraint.

replies(1): >>autoex+CP

>>pulkit+9K
Right now I'd say maybe don't push changes to your entire global infra all at once and certainty not without testing your change first to make sure it doesn't break anything, but it's really not about a specific failure/fix as much as it is about a single company getting too big to do the job well or just plain doing more than it should in the first place.

Honestly we shouldn't have created a system where any single company's failure is able to impact such a huge percentage of the network. The internet was designed for resilience and we abandoned that ideal to put our trust in a single company that maybe isn't up for the job. Maybe no one company ever could do it well enough, but I suspect that no single company should carry that responsibility in the first place.

replies(1): >>mewpme+Mp1

>>vlovic+yI
You really should take some of your pill.

> Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.

> Unfortunately, in our FL1 version of our proxy, under certain circumstances, the second change of turning off our WAF rule testing tool caused an error state that resulted in 500 HTTP error codes to be served from our network.

The body parsing logic is in react or nextjs, that's my takeaway, is it that incorrect? and the WAF rule testing tool (control plane) was interdependent with the WAF's body parsing logic, is that also incorrect?

> This was a configuration change to change the buffered size of a body from 256kb to 1mib.

Yes, and if it was resilient,the body parsing is done on a discrete forwarding plane. Any config changes should be auto-tested for forwarding failures by the separate control plane and auto-revered when there are errors. If the waf rule testing tool was part of that test then it being down shouldn't have affected data-plane because it would be a separate system.

data/control plane separate means the run time of the two and any dependencies they have are separate. It isn't cheap to do this right, that's why I speculated (I made clear i was speculating) that it was because they wanted to save costs.

> The ability to be so wrong in so few words with such confidence is impressive but you may want to take more of a curiosity first approach rather than reaction first.

Please tone down the rage a bit and leave room for some discussion. You should take your own pill and be curious about what I meant instead of taking a rage-first approach.

replies(2): >>jadams+sf1 >>mewpme+Hk1

>>autoex+xF
What "hundreds of billions of dollars"? Cloudflare's annual revenue is around $2 billion, and they are not yet profitable.

replies(2): >>autoex+sj1 >>froobe+yt1

>>vlovic+tr
With all due respect, engineers in finance can’t allow for outages like this because then you are losing massive amounts of money and potentially going out of business.

>>notepa+5U
> The body parsing logic is in react or nextjs, that's my takeaway, is it that incorrect?

The exploit they were trying to protect against is in React services run by their customers.

replies(1): >>notepa+cC1

>>evanel+kV
That was admittedly hyperbole, but since we're talking about a company with assets and revenue in the billions I'm not sure it matters. The fact remains that a lack of money/resources is not their problem.

replies(1): >>evanel+Xm1

>>notepa+5U
To be clear:

1. There is an active vulnerability unrelated to Cloudflare where React/Next.JS can be abused via a malicious payload. The payload could be up to 1MB.

2. Cloudflare had buffer size that wasn't enough to prevent that payload from being sent to the Customer of the Cloudflare.

3. Cloudflare to protect their customers wanted to increase the buffer size to 1MB.

4. Internal Testing Tool wasn't able to handle change to 1MB and started failing.

5. They wanted to stop Internal Testing Tool from failing, but the Internal Testing Tool required disabling a ruleset which an existing system was depending on (due to a long existing bug). This caused the wider incident.

It does seem to be like a mess in the sense that in order to stop internal testing tool from failing they had to endanger things globally in production, yes. It looks like legacy, tech debt mess.

It seems like bad decisions done in the past though.

replies(1): >>phatfi+iH1

>>autoex+sj1
They don't have unlimited resources. They have ~5000 employees. That's not small but it's not huge either. For sake of comparison, Google hit that headcount level literally 20 years ago.

replies(1): >>autoex+tt1

>>autoex+CP
But then would a customer have to use 10 different vendors to get the same things that Cloudflare currently provides? E.g. protection against various threats online?

>>evanel+Xm1
They have enough money to buy anything they need. The CEO alone has billions. He could pay for as many employees as he wants out of his own pocket and not notice. In fact he's good at buying people, even senators.

replies(1): >>evanel+iz1

>>evanel+kV
Given how well-established cloudflare is, I would've figured they'd be profitable by now. That raises the question: why does so much of the web rely on a company which does not have the means to sustain itself?

replies(1): >>bdangu+Kt1

>>froobe+yt1
given how much of the population relies on Uber for their transportation… ;)

>>autoex+tt1
That doesn't make sense. It would be like saying Twitter, SpaceX, and Tesla all should be incapable of engineering mistakes because their owner is rich. The world doesn't work that way.

>>jadams+sf1
that makes better sense now, thanks. I feel dumb now that I re-read it, in my mind they patched nextjs/react and the new patch somehow required more buffer size.

>>mewpme+Hk1
Maybe they should have focused on fixing the "legacy tech debt mess", rather than pushing out more and more services and trying to be like AWS or Azure.

>>dehrma+tq
The blog post said the rollout of the config change took 1 minute.

>>vlovic+tr
The fintech company I worked at does handle millions of QPS has has thousands of servers. It is on the same order of magnitude or at least 0.1x scale, not to mention the complexity of business logic involving monetary transactions.

If there’s indeed a 5 min lag in monitoring dashboard in Cloudflare, I honestly think that's a pretty big concern.

For example, a simple curl script on your top 100 customers' homepage that runs every 30 seconds would have given the warning and notifications within a minute. If you stagger deployments at 5 minute intervals, you could have identified the issue and initiated the rollback within 2 minutes and completed it within 3 minutes.

>>autoex+jJ
Every major cloud provider (including Cloudflare) is orders of magnitude better at keeping 9s of availability worldwide for thousands of customers than those customers are individually. The very best of those customers might be better and only rely on cloud providers for the scaling or huge amounts of infrastructure they don’t otherwise want to own, but the vast majority are actually less capable at accomplishing whatever uptime the providers already get.

Could cloudflare do better? Sure, that’s a truism for everyone. Did they make mistakes and continue to make mistakes? Also a truism.

Trust me, they are acutely aware of people getting upset when they fail. Why do you think they’re CEO and CTO are writing these blog posts?