zlacker

I noticed this outage last night (Cloudflare 500s on a few unrelated websites). As usual, when I went to Cloudflare's status page, nothing about the outage was present; the only thing there was a notice about the pre-planned maintenance work they were doing for the security issue, reporting that everything was being routed around it successfully.

replies(3): >>cnnliv+K8 >>mrb+vq >>matteo+KI1

>>jacobg+(OP)
This is the case with just about every status page I’ve ever seen. It takes them a while to realize there’s really a problem and then to update the page. One day these things will be automated, but until then, I wouldn’t expect more of Cloudflare than any other provider.

What’s more concerning to me is that now we’ve had AWS, Azure, and CloudFlare (and CliudFlare twice) go down recently. My gut says:

1. developers and IT are using LLMs in some part of the process, which will not be 100% reliable.

2. Current culture of I have (some personal activity or problem) or we don’t have staff, AI will replace me, f-this.

3. Pandemic after effects.

4. Political climate / war / drugs; all are intermingled.

replies(4): >>mikkup+ha >>Techni+nf >>colech+Ql >>sugerm+Gy

>>cnnliv+K8
Management doesn't like when things like this are automated. They want to "manage" the outage/production/etc numbers before letting them out.

replies(2): >>Yeri+Ga >>kbolin+Kk

>>mikkup+ha
100% — will never be automated :)

replies(1): >>hnuser+fj

>>cnnliv+K8
Thing is, these things are automated... Internally.

Which makes it feel that much more special when a service provides open access to all of the infrastructure diagnostics, like e.g. https://status.ppy.sh/

replies(1): >>rezona+1i

>>Techni+nf
Nice! Didn't know you could make a Datadog dashboard public like that!

>>Yeri+Ga
Still room for someone to claim the niche of the Porsche horsepower method in outage reporting - underpromise, overdeliver.

>>mikkup+ha
There's no sweet spot I've found. I don't work for Cloudflare but when I did have a status indicator to maintain, you could never please everyone. Users would complain when our system was up but a dependent system was down, saying that our status indicator was a lie. "Fixing" that by marking our system as down or degraded whenever a dependent system was down led to the status indicator being not green regularly, causing us to unfairly develop a reputation as unreliable (most broken dependencies had limited blast radius). The juice no longer seemed worth the squeeze and we gave up on automated status indicators.

replies(3): >>jacobg+Ts >>naniwa+cw >>noname+n42

>>cnnliv+K8
>It takes them a while to realize there’s really a problem and then to update the page.

Not really, they're just lying. I mean yes of course they aren't oracles who discover complex problems in instant of the first failure, but naw they know when well there are problems and significantly underreport them to the extent they are are less "smoke alarms" and more "your house has burned down and the ashes are still smoldering" alarms. Incidents are intentionally underreported. It's bad enough that there ought to be legislation and civil penalties for the large providers who fail to report known issues promptly.

>>jacobg+(OP)
Only way to change that it to shame them for it: "Cloudflare is so incompetent at detecting and managing outages that even their simple status page is unable to be accurate"

If enough high-ranked customers report this feedback...

>>kbolin+Kk
> "Fixing" that by marking our system as down or degraded whenever a dependent system was down led to the status indicator being not green regularly, causing us to unfairly develop a reputation as unreliable (most broken dependencies had limited blast radius).

This seems like an issue with the design of your status page. If the broken dependencies truly had a limited blast radius, that should've been able to be communicated in your indicators and statistics. If not, then the unreliable reputation was deserved, and all you did by removing the status page was hide it.

replies(1): >>Aeolun+QI

>>kbolin+Kk
The headline status doesn't have to be "worst of all systems". Pick a key indicator, and as long as it doesn't look like it's all green regardless of whether you're up or down, users will imagine that "green headline, red subsystems" means whatever they're observing, even if that makes the status display utterly uninterpretable from an outside perspective.

>>cnnliv+K8
Those are complex and tenuous explanations for events that have occurred since long before all of your reasons came into existence.

>>jacobg+Ts
> all you did by removing the status page was hide it

True, but everyone that actually made the company work was much happier for it.

>>jacobg+(OP)
The status page was updated 6 minutes after the first internal alert was triggered (8:50 -> 8:56:26 UTC), I wouldn't say this is too long.

>>kbolin+Kk
> whenever a dependent system was down led to the status indicator being not green regularly, causing us to unfairly develop a reputation as unreliable (most broken dependencies had limited blast radius)

You are responsible of your dependencies, unless they are specific integrations. Either switch to more reliable dependencies or add redundancy so that you can switch between providers when any is down.