zlacker

[parent] [thread] 15 comments
1. troupo+(OP)[view] [source] 2023-06-29 19:47:06
You'd be surprised how often those pages are updated manually. By the person on call who has other things to take care of first.
replies(1): >>Myster+Yc
2. Myster+Yc[view] [source] 2023-06-29 20:53:20
>>troupo+(OP)
Because a healthcheck ping every X seconds is too difficult to implement for a GitHub sized company? There they have it now. Useless status page...
replies(7): >>virapt+lg >>naikro+Jg >>sjsdai+tk >>nijave+0C >>terom+lE >>camden+Wb1 >>iso163+nF1
◧◩
3. virapt+lg[view] [source] [discussion] 2023-06-29 21:11:45
>>Myster+Yc
Because a ping does not have a consistent behaviour and sometimes will fail because of networking issues at the source. If you enable pingdom checks for many endpoints and all available regions, prepare for a some false positives every week for example.

At that point it's worse than what you already know from your browser - it may show the service is having issues when you can access it, or that the service is ok when you can't.

replies(1): >>ben0x5+rG
◧◩
4. naikro+Jg[view] [source] [discussion] 2023-06-29 21:13:12
>>Myster+Yc
make a healthcheck ping every x seconds that never ever gives a false positive. ever.

try that and you'll understand why they update the pages manually.

replies(1): >>Guilla+Gx1
◧◩
5. sjsdai+tk[view] [source] [discussion] 2023-06-29 21:31:53
>>Myster+Yc
Quoting a prior comment of mine from a similar discussion in the past...

Stage 1: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.

Problems: Delayed or missed updates. Customers complain that you're not being honest about outages.

Stage 2: Status is automatically set based on the outcome of some monitoring check or functional test.

Problems: Any issue with the system that performs the "up or not?" source of truth test can result in a status change regardless of whether an actual problem exists. "Override automatic status updates" becomes one of the first steps performed during incident response, turning this into "status is manually set, but with extra steps". Customers complain that you're not being honest about outages and latency still sucks.

Stage 3: Status is automatically set based on a consensus of results from tests run from multiple points scattered across the public internet.

Problems: You now have a network of remote nodes to maintain yourself or pay someone else to maintain. The more reliable you want this monitoring to be, the more you need to spend. The cost justification discussions in an enterprise get harder as that cost rises. Meanwhile, many customers continue to say you're not being honest because they can't tell the difference between a local issue and an actual outage. Some customers might notice better alignment between the status page and their experience, but they're content, so they have little motivation to reach out and thank you for the honesty.

Eventually, the monitoring service gets axed because we can just manually update the status page after all.

Stage 4: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.

Not saying this is a great outcome, but it is an outcome that is understandable given the parameters of the situation.

replies(2): >>ben0x5+nF >>dclowd+A21
◧◩
6. nijave+0C[view] [source] [discussion] 2023-06-29 23:11:56
>>Myster+Yc
You quickly start to get into "what does down mean?" conversations. When you have a bunch of geographical locations and thousands of different systems/functionalities, it's not always clear if something is down.

Take a service responding 1% of the time with errors. Probably not "down". What about 10%? Probably not. What about 50%? Maybe, hard to say.

Maybe there's a fiber cut in rural village effecting 100% of your customers there but only 0.0001% of total customers?

Sure there's cases like this where everything is hosed but it sort of begs the question "is building a complex monitoring system for <some small number of downtimes a year>" actually worth it?

◧◩
7. terom+lE[view] [source] [discussion] 2023-06-29 23:29:52
>>Myster+Yc
It's more a question of which of the (tens of) thousands of various different healthcheck pings that GitHub has undoubtedly implemented across their infrastructure should be used to determine the status page status?
◧◩◪
8. ben0x5+nF[view] [source] [discussion] 2023-06-29 23:36:53
>>sjsdai+tk
I think as an external user I'd be happiest if they just provided multiple indicators on the status page? Like,

    Internal metrics: Healthy
    External status check: Healthy
    Did ops announce an incident: No
    Backend API latency: )`'-.,_)`'-.,_)`'-.,_)`'-.,_)`'-.,_)`'-.,_)`'-.,_)`'-.,_
And when there's disagreement between indicators I can draw my own conclusions.

I guess in reality the very existence of a status page is a tenuous compromise between engineers wanting to be helpful towards external engineers, and business interests who would prefer to sweep things under various rugs as much as possible ("what's the point of a website whose entire point is to tell the world that we're currently fucking up?").

replies(1): >>troupo+Dq1
◧◩◪
9. ben0x5+rG[view] [source] [discussion] 2023-06-29 23:42:39
>>virapt+lg
> At that point it's worse than what you already know from your browser - it may show the service is having issues when you can access it, or that the service is ok when you can't.

Worst case you have more data points to draw conclusions from. Status page red, works for me? Hmm, maybe that's why the engineers in the other office are goofing off on Slack. Status page green, I get HTTP 500s? Guess I can't do this thing but maybe other parts of the app still work?

replies(1): >>virapt+Ui2
◧◩◪
10. dclowd+A21[view] [source] [discussion] 2023-06-30 02:18:56
>>sjsdai+tk
Many of us create incidents and page people in the middle of the night when there’s an issue. I assume there’s a built in filter there to ensure people are only paged when there’s actually something bad going on. Seems like a pretty reasonable place to change a public status somewhere.
◧◩
11. camden+Wb1[view] [source] [discussion] 2023-06-30 03:54:51
>>Myster+Yc
Making an official status change in a large organization can be kind of a big deal. Sometimes phrasing needs to be run by legal or customer-facing people. There can be contract implications.

Of course they should try to update their status page in a timely manner, but it is frequently manual from what I’ve seen.

◧◩◪◨
12. troupo+Dq1[view] [source] [discussion] 2023-06-30 06:27:08
>>ben0x5+nF
> I think as an external user I'd be happiest if they just provided multiple indicators on the status page

This is equivalent to step 3 :)

replies(1): >>ben0x5+Fs3
◧◩◪
13. Guilla+Gx1[view] [source] [discussion] 2023-06-30 07:33:39
>>naikro+Jg
False positives are not that important, false negatives are more annoying, for the users at least...
◧◩
14. iso163+nF1[view] [source] [discussion] 2023-06-30 08:49:45
>>Myster+Yc
github.com was loading fine for me from a dozen+ locations. It seemed a localised problem to a small part of the internet
◧◩◪◨
15. virapt+Ui2[view] [source] [discussion] 2023-06-30 13:35:11
>>ben0x5+rG
So essentially in neither situation did you get any information that changes what you'd do next. If something fails, you'll probably try working on another part and see if that works anyway. The automated status provided you with no extra actionable info.
◧◩◪◨⬒
16. ben0x5+Fs3[view] [source] [discussion] 2023-06-30 17:39:31
>>troupo+Dq1
Ah, I read step 3 as "a bunch of data gets condensed into one public indicator" rather than "a bunch of data gets published"
[go to top]