zlacker

[parent] [thread] 11 comments
1. SV_Bub+(OP)[view] [source] 2023-06-29 17:48:49
If it takes someone to manually change it from green to red, that does seem to defeat the purpose.
replies(6): >>evulho+61 >>klysm+e1 >>jabart+g1 >>distor+q1 >>numbsa+v2 >>AYBABT+t6
2. evulho+61[view] [source] 2023-06-29 17:52:27
>>SV_Bub+(OP)
Yep, and when money comes into play when you're supposed to meet SLAs, you certainly don't want it being automatic.
3. klysm+e1[view] [source] 2023-06-29 17:52:56
>>SV_Bub+(OP)
Possibly, but sometimes with failures this bad you can't get to the page to update it.
replies(1): >>munk-a+c2
4. jabart+g1[view] [source] 2023-06-29 17:52:59
>>SV_Bub+(OP)
No it doesn't. The amount of false alarm alerts you can get with internet based monitoring is more than 0. You could have a BGP route break things for one ISP your monitoring happens to use. You could have a failover event happening where it takes 30 seconds for everything to converge. I have multiple monitors on my app at 1 minute intervals from different vendors and ALWAYS a user will email us within 5 seconds of an issue. It's not realistic for a company to have automatic status updates trigger things without a person manually reviewing them because too many things can go wrong on the automatic status update to cause panic.
replies(2): >>lucb1e+X1 >>wongar+g3
5. distor+q1[view] [source] 2023-06-29 17:53:35
>>SV_Bub+(OP)
Unknown unknowns means you can have catastrophic system failures that automated alerts don't detect.
◧◩
6. lucb1e+X1[view] [source] [discussion] 2023-06-29 17:54:55
>>jabart+g1
Who would panic? If nobody notices it's out because it's not, then nobody is going to be checking the status page. And if they do see the status page showing red while it's up, it's not like they're going to be unhappy about their SLA being met.

Maybe you want human confirmation on historic figures, but the live thing might as well be live.

◧◩
7. munk-a+c2[view] [source] [discussion] 2023-06-29 17:55:30
>>klysm+e1
There was that hilarious multi-hour AWS failure a while back where the status page was updated via one of their internal services... and that service went down as part of the outage.
8. numbsa+v2[view] [source] 2023-06-29 17:56:26
>>SV_Bub+(OP)
I bet they could teach Co-Pilot to create a PR to make the change, and build some GitHub actions to automatically merge those changes.
◧◩
9. wongar+g3[view] [source] [discussion] 2023-06-29 17:59:21
>>jabart+g1
Most paid status monitoring services cover BGP route problems and ISP issues by only flagging an event if it's detected from X geographically diverse endpoints.

For the 30 seconds where you wait for failover to complete: that is a 30 second outage. It's not necessarily profitable to admit to it, but showing it as a 30 second outage would be accurate

replies(2): >>jabart+Mc >>jabart+cd
10. AYBABT+t6[view] [source] 2023-06-29 18:09:33
>>SV_Bub+(OP)
Not really, things fail in unexpected ways. Automated anomaly detection is notoriously error prone, leading to a lot of false positive and false negatives, in the trivial case of monitoring a single timeseries. For a system the size of GitHub, you need to monitor a whole host of things and if it's quasi impossible to do one timeseries well, there's basically no hope of doing automated many timeseries anomaly detection with a signal-to-noise ratio that's better than "humans looking at the thing and realizing it's not going well".

There's stuff like this that can't be automated well. The automated result is far worse than the human-based alternative.

◧◩◪
11. jabart+Mc[view] [source] [discussion] 2023-06-29 18:36:54
>>wongar+g3
TCP default is more than 30 seconds. The internet itself has about a 99.9% uptime. If one company showed every 30 second blip on their outage page all their competitors would have that screenshot on the first page of their pitch deck even if they also had the same issue. 2-5 minutes is reasonable for a public service to announce an outage.
◧◩◪
12. jabart+cd[view] [source] [discussion] 2023-06-29 18:38:58
>>wongar+g3
Forgot about that centurylink BGP infinite loop route bug they had where it took down their whole system nationwide. A lot of monitoring services showed red even though it was one ISP that was done.
[go to top]