zlacker

> Disabling this was done using our global configuration system. This system does not use gradual rollouts but rather propagates changes within seconds to the entire network and is under review following the outage we recently experienced on November 18.

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules

Warning signs like this are how you know that something might be wrong!

replies(5): >>philip+m6 >>testpl+5x >>bombca+b51 >>shadow+Hd1 >>8note+wv1

>>Scaevo+(OP)
> Warning signs like this are how you know that something might be wrong!

Yes, as they explain it's the rollback that was triggered due to seeing these errors that broke stuff.

replies(2): >>Scaevo+Cg >>8cvor6+Ng

>>philip+m6
They saw errors and decided to do a second rollout to disable the component generating errors, causing a major outage.

replies(1): >>JesseJ+fI1

>>philip+m6
Would be nice if the outage dashboards are directly linked to this instead of whatever they have now.

>>Scaevo+(OP)
> They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

This is what jumped out at me as the biggest problem. A wild west deployment process is a valid (but questionable) business decision, but if you do that then you need smart people in place to troubleshoot and make quick rollback decisions.

Their timeline:

> 08:47: Configuration change deployed and propagated to the network

> 08:48: Change fully propagated

> 08:50: Automated alerts

> 09:11: Configuration change reverted and propagation start

> 09:12: Revert fully propagated, all traffic restored

2 minutes for their automated alerts to fire is terrible. For a system that is expected to have no downtime, they should have been alerted to the spike in 500 errors within seconds before the changes even fully propagated. Ideally the rollback would have been automated, but even if it is manual, the dude pressing the deploy button should have had realtime metrics on a second display with his finger hovering over the rollback button.

Ok, so they want to take the approach of roll forward instead of immediate rollback. Again, that's a valid approach, but you need to be prepared. At 08:48, they would have had tens of millions of "init.lua:314: attempt to index field 'execute'" messages being logged per second. Exact line of code. Not a complex issue. They should have had engineers reading that code and piecing this together by 08:49. The change you just deployed was to disable an "execute" rule. Put two and two together. Initiate rollback by 08:50.

How disconnected are the teams that do deployments vs the teams that understand the code? How many minutes were they scratching their butts wondering "what is init.lua"? Are they deploying while their best engineers are sleeping?

replies(2): >>morphe+0Y >>bostik+ni1

>>testpl+5x
I see lots of people complaining about this down time but in actuality is it really that big a deal to have 30 minutes of down time or whatever. It's not like anything behind cloudflare is "mission critical" in the sense that lives are at stake or even a huge amount of money is at stake. In many developed countries the electric power service has local down times on occasion. That's more important than not being able to load a website. I agree if CF is offering a certain standard of reliability and not meeting it then they should offer prorated refunds for the unexpected down time but otherwise I am not seeing what the big deal is here.

replies(5): >>therei+dZ >>bombca+X51 >>ljm+Zc1 >>odie55+jd1 >>morito+sy1

>>morphe+0Y
> about this down time but in actuality is it really that big a deal to have 30 minutes of down time or whatever. It's not like anything behind cloudflare is "mission critical" in the sense that lives are at stake or even a huge amount of money is at stake.

This reads like sarcasm. But I guess it is not. Yes, you are a CDN, a major one at that. 30 minutes of downtime or "whatever" is not acceptable. I worked at traffic teams of social networks that looked at themselves as that mission critical. CF is absolutely that critical and it is definitely lives at stake.

>>Scaevo+(OP)
“ Uh...it's probably not a problem...probably...but I'm showing a small discrepancy in...well, no, it's well within acceptable bounds again. Sustaining sequence. Nothing you need to worry about, Gordon. Go ahead.“

>>morphe+0Y
30 minutes of downtime is fine for most things, including Amazon.

30 minutes of unplanned downtime for infrastructure is unacceptable; but we’re tending to accept it. AWS or Cloudflare have positioned themselves as The Internet so they need to be held to a higher standard.

>>morphe+0Y
> It's not like anything behind cloudflare is "mission critical" in the sense that lives are at stake or even a huge amount of money is at stake.

This is far too dismissive of how disruptive the downtime can be and it sets the bar way too low for a company so deeply entangled in global internet infrastructure.

I don’t think you can make such an assertion with any degree of credibility.

>>morphe+0Y
> It's not like anything behind cloudflare is "mission critical" in the sense that lives are at stake or even a huge amount of money is at stake.

Yes, there are lots of mission critical systems that use cloudflare and lives and huge amounts of money are at stake.

replies(1): >>morphe+2G1

>>Scaevo+(OP)
"Hey, this change is making the 'check engine' light turn on all the time. No problem; I just grabbed some pliers and crushed the bulb."

>>testpl+5x
> 2 minutes for their automated alerts to fire is terrible

I take exception to that, to be honest. It's not desirable or ideal, but calling it "terrible" is a bit ... well, sorry to use the word ... entitled. For context, I have experience running a betting exchange. A system where it's common for a notable fraction of transactions in a medium-volume event to take place within a window of less than 30 seconds.

Vast majority of current monitoring systems are built on Prometheus. (Well okay, these days it's more likely something Prom-compatible but more reliable.) That implies collection via recurring scrapes. A supposedly "high" frequency online service monitoring system does a scrape every 30 seconds. Well known reliability engineering practices state that you need a minimum of two consecutive telemetry points to detect any given event - because we're talking about a distributed system and network is not a reliable transport. That in turn means that with near-perfect reliability the maximum time window before you can detect something failing is the time it takes to perform three scrapes: thing A might have failed a second after the last scrape, so two consecutive failures will show up only after a delay of just-a-hair-shy-of-three scraping cycle windows.

At Cloudflare's scale, I would not be surprised if they require three consecutive events to trigger an alert.

As for my history? The betting exchange monitoring was tuned to run scrapes at 10-second intervals. That still meant that the first an alert fired for something failing could have been effectively 30 seconds after the failures manifested.

Two minutes for something that does not run primarily financial transactions is a pretty decent alerting window.

replies(3): >>parchl+1l1 >>dotanc+Up1 >>yearol+Wx1

>>bostik+ni1
> At Cloudflare's scale, I would not be surprised if they require three consecutive events to trigger an alert.

Sorry but that’s a method you use if you serve 100 requests per second, not when you are at Cloudflare scale. Cloudflare easily have big enough volume that this problem would trigger an instant change in a monitorable failure rate.

replies(1): >>rossju+DM1

>>bostik+ni1
Prometheus compatible but more reliable? Sell it to me!

replies(1): >>bostik+Gl2

>>Scaevo+(OP)
they arent a panacea though, internal tools like that can be super noisy on errors, and be broken more often than theyre working

>>bostik+ni1
Critical high-level stats such as errors should be scraped more frequently than 30 seconds. It’s important to have multiple time granularity scraping intervals, a small set of most critical stats should be scraped closer to 10s or 15s.

Prometheus has as an unaddressed flaw [0], where rate functions must be at least 2x the scrape interval. This means that if you scrape at 30s intervals, your rate charts won’t reflect the change until a minute after.

[0] - https://github.com/prometheus/prometheus/issues/3746

replies(1): >>rossju+PN1

>>morphe+0Y
I am confident there is at least a few hospitals, gp offices or ticketing systems that interact directly or indirectly with Cloud flare. They've sold themselves as a major defence in security.

>>odie55+jd1
Can you provide an example of lives being at stake because of a cloud flare outage?

replies(1): >>kortil+nl3

>>Scaevo+Cg
That was their first mistake, if your deployment does not behave the way you expect to (or even give you bad smell) roll back, that how it used to be... when I was a kid...lol Or I don't know, maybe load test before you deploy.....?

>>parchl+1l1
Let's say you have 10 million servers. No matter what you are deploying, some of those servers are going to be in a bad state. Some are going to be in hw ops/repair. Some are going to be mid-deployment for something else. A regional issue may be happening. You may have inconsistencies in network performance. Bad workloads somewhere/anywhere can be causing a constant level of error traffic.

At scale there's no such thing as "instant". There is distribution of progress over time.

The failure is an event. Collection of events takes times (at scale, going through store and forward layers). Your "monitorable failure rate" is over an interval. You must measure for that interval. And then you are going to emit another event.

Global config systems are a tradeoff. They're not inherently bad; they have both strengths and weaknesses. Really bad: non-zero possibility for system collapse. Bad: Can progress very quickly to towards global outages. Good: Faults are detected quickly, response decision making is easy, and mitigation is fast.

Hyperscale is not just "a very large number of small simple systems".

Denoising alerts is a fact of life for SRE...and is a survival skill.

>>yearol+Wx1
"Scrape" intervals (and the plumbing through to analysis intervals) are chosen precisely because of the denoising function aggregation provides.

Most scaled analysis systems provide precise control over the type of aggregation used within the analyzed time slices. There are many possibilities, and different purposes for each.

High frequency events are often collected into distributions and the individual timestamps are thrown away.

>>dotanc+Up1
VictoriaMetrics. The answer to the question "could I get Prometheus, but with ClickHouse architecture?"

replies(1): >>dotanc+uy2

>>bostik+Gl2
I'll be looking into that, thank you.

While we're here, any other Prometheus or Grafana advice is welcome.

>>morphe+2G1
https://creators.spotify.com/pod/profile/epicompliance/episo...