zlacker

So down right now...I wonder why they still use https://www.githubstatus.com/ that reports everything is alright when it's not!

replies(4): >>SV_Bub+u >>Shekel+k1 >>munk-a+e2 >>maschu+8f4

>>stefan+(OP)
If it takes someone to manually change it from green to red, that does seem to defeat the purpose.

replies(6): >>evulho+A1 >>klysm+I1 >>jabart+K1 >>distor+U1 >>numbsa+Z2 >>AYBABT+X6

>>stefan+(OP)
Pretty much every company has been shown to have fake status pages at this point.

replies(5): >>ezekg+E3 >>klysm+R4 >>wsatb+65 >>Night_+95 >>AYBABT+W5

>>SV_Bub+u
Yep, and when money comes into play when you're supposed to meet SLAs, you certainly don't want it being automatic.

>>SV_Bub+u
Possibly, but sometimes with failures this bad you can't get to the page to update it.

replies(1): >>munk-a+G2

>>SV_Bub+u
No it doesn't. The amount of false alarm alerts you can get with internet based monitoring is more than 0. You could have a BGP route break things for one ISP your monitoring happens to use. You could have a failover event happening where it takes 30 seconds for everything to converge. I have multiple monitors on my app at 1 minute intervals from different vendors and ALWAYS a user will email us within 5 seconds of an issue. It's not realistic for a company to have automatic status updates trigger things without a person manually reviewing them because too many things can go wrong on the automatic status update to cause panic.

replies(2): >>lucb1e+r2 >>wongar+K3

>>SV_Bub+u
Unknown unknowns means you can have catastrophic system failures that automated alerts don't detect.

>>stefan+(OP)
https://downdetector.com/status/github/ is a far more reliable source - it's just powered by user reports and often will show issues long before the status page ever receives an update.

replies(1): >>jachee+OB

>>jabart+K1
Who would panic? If nobody notices it's out because it's not, then nobody is going to be checking the status page. And if they do see the status page showing red while it's up, it's not like they're going to be unhappy about their SLA being met.

Maybe you want human confirmation on historic figures, but the live thing might as well be live.

>>klysm+I1
There was that hilarious multi-hour AWS failure a while back where the status page was updated via one of their internal services... and that service went down as part of the outage.

>>SV_Bub+u
I bet they could teach Co-Pilot to create a PR to make the change, and build some GitHub actions to automatically merge those changes.

>>Shekel+k1
Pretty much. They want the burden of proof for SLAs to fall on the customer, not on themselves. If a customer has to prove that an outage specifically affected them, they are much less likely to have a successful case against the failure to meet their SLA.

(Not directed at GitHub specifically, but at bogus status pages.)

>>jabart+K1
Most paid status monitoring services cover BGP route problems and ISP issues by only flagging an event if it's detected from X geographically diverse endpoints.

For the 30 seconds where you wait for failover to complete: that is a 30 second outage. It's not necessarily profitable to admit to it, but showing it as a 30 second outage would be accurate

replies(2): >>jabart+gd >>jabart+Gd

>>Shekel+k1
fake and not automated are pretty different

>>Shekel+k1
From my experience, GitHub is the best out there when it comes to updating their status page.

>>Shekel+k1
Really? Why?

That's so disappointing.

replies(1): >>cududa+w8

>>Shekel+k1
Status pages are updated by humans and the humans need to (1) realize there's a problem and (2) understand the magnitude of the problem and (3) put that on the status page.

It's not fake, it's just a human process. And automating this would be error prone just the same.

replies(3): >>Macuyi+V6 >>wsatb+t9 >>jachee+2C

>>AYBABT+W5
Very good points. Meanwhile I have clients asking me why they can't have a status page to which I reply: you can, but ultimately to be completely fail proof it will be a human updating it slowly. To which they reply: but GitHub or X does it...

Very infuriating, that.

replies(1): >>AYBABT+E8

>>SV_Bub+u
Not really, things fail in unexpected ways. Automated anomaly detection is notoriously error prone, leading to a lot of false positive and false negatives, in the trivial case of monitoring a single timeseries. For a system the size of GitHub, you need to monitor a whole host of things and if it's quasi impossible to do one timeseries well, there's basically no hope of doing automated many timeseries anomaly detection with a signal-to-noise ratio that's better than "humans looking at the thing and realizing it's not going well".

There's stuff like this that can't be automated well. The automated result is far worse than the human-based alternative.

>>Night_+95
Two technical reasons capstoned by driving business motivation:

-False positives -Short outages that last a minute or three

Ultimately, SLA's and uptime guarantees. That way, a business can't automatically tally every minute of publicly admitted downtime against the 99.99999% uptime guarantee, and the onus to prove a breach of contract is on the customer

>>Macuyi+V6
There's some nice tooling these days for this. E.g. https://firehydrant.com/ and https://incident.io both make this a faster, more embedded process.

replies(2): >>sjwhit+mf >>amanda+qR

>>AYBABT+W5
I wouldn't necessarily call them fake, but the issue often has to be big enough for most companies to admit to it. AWS often has smaller outages that they will never acknowledge.

>>wongar+K3
TCP default is more than 30 seconds. The internet itself has about a 99.9% uptime. If one company showed every 30 second blip on their outage page all their competitors would have that screenshot on the first page of their pitch deck even if they also had the same issue. 2-5 minutes is reasonable for a public service to announce an outage.

>>wongar+K3
Forgot about that centurylink BGP infinite loop route bug they had where it took down their whole system nationwide. A lot of monitoring services showed red even though it was one ISP that was done.

>>AYBABT+E8
Hey, incident.io CEO here! Thanks for mentioning us.

>>munk-a+e2
Keep in mind that downdetector can be brigaded and/or show knock-on problems instead of root causes. e.g. A couple weeks ago there were fairly major spikes across a rather huge variety of services on there, but it turned out that it was actually Comcast that was having trouble, rather than any of the “down” services.

>>AYBABT+W5
Also (2b) convince their boss that the “optics” are better to update sooner than later.

>>AYBABT+E8
And Jeli.io for this! With the Statuspage integration, you can set the status, impact, write a message for customers, and select impacted components all without leaving Slack. Statuspage gets updated with a click of a button.

>>stefan+(OP)
Maybe the status page is down - it needs a status page to tell us if the status page is down