zlacker

[return to "Github.com is down"]
1. dmatti+L3[view] [source] 2023-06-29 17:52:35
>>AlphaW+(OP)
Putting your status page on a separate domain for availability reasons: good

Not updating that status page when the core domain goes down: less good

◧◩
2. troupo+Pu[view] [source] 2023-06-29 19:47:06
>>dmatti+L3
You'd be surprised how often those pages are updated manually. By the person on call who has other things to take care of first.
◧◩◪
3. Myster+NH[view] [source] 2023-06-29 20:53:20
>>troupo+Pu
Because a healthcheck ping every X seconds is too difficult to implement for a GitHub sized company? There they have it now. Useless status page...
◧◩◪◨
4. sjsdai+iP[view] [source] 2023-06-29 21:31:53
>>Myster+NH
Quoting a prior comment of mine from a similar discussion in the past...

Stage 1: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.

Problems: Delayed or missed updates. Customers complain that you're not being honest about outages.

Stage 2: Status is automatically set based on the outcome of some monitoring check or functional test.

Problems: Any issue with the system that performs the "up or not?" source of truth test can result in a status change regardless of whether an actual problem exists. "Override automatic status updates" becomes one of the first steps performed during incident response, turning this into "status is manually set, but with extra steps". Customers complain that you're not being honest about outages and latency still sucks.

Stage 3: Status is automatically set based on a consensus of results from tests run from multiple points scattered across the public internet.

Problems: You now have a network of remote nodes to maintain yourself or pay someone else to maintain. The more reliable you want this monitoring to be, the more you need to spend. The cost justification discussions in an enterprise get harder as that cost rises. Meanwhile, many customers continue to say you're not being honest because they can't tell the difference between a local issue and an actual outage. Some customers might notice better alignment between the status page and their experience, but they're content, so they have little motivation to reach out and thank you for the honesty.

Eventually, the monitoring service gets axed because we can just manually update the status page after all.

Stage 4: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.

Not saying this is a great outcome, but it is an outcome that is understandable given the parameters of the situation.

[go to top]