zlacker

[return to "Tell HN: HN was down"]
1. dang+zk[view] [source] 2025-12-17 18:09:25
>>uyzstv+(OP)
Yes, sorry! We're investigating, but my current theory is we got overloaded because I relaxed some of our anti-crawler protections a few days ago.

(The reason I did that is that the anti-crawler protections also unfortunately hit some legit users, and we don't want to block legit users. However, it seems that I turned the knobs down too far.)

In this case, though, we had a secondary failure: PagerDuty woke me up at 5:24am, I checked HN and it seemed fine, so I told PagerDuty the problem was resolved. But the problem wasn't resolved - at that point I was just sleeping through it.

I'll add more as we find out more, but it probably won't be till later this afternoon PST.

Edit: later than I expected, but for those still following, the main things I've learned are (1) pkill wasn't able to kill SBCL this time - we have a script that does that when HN stops responding, but it didn't work, so we'll revise the script; and (2) how to get PagerDuty not to let you go back to sleep if your site is actually still down.

◧◩
2. shlomo+Rq[view] [source] 2025-12-17 18:38:06
>>dang+zk
Crazy that Dang literally manages HN in his sleep!

We all knew that but I haven't seen any confirmation before this.

◧◩◪
3. dang+vu[view] [source] 2025-12-17 18:53:02
>>shlomo+Rq
failing to manage HN in my sleep is more like it
◧◩◪◨
4. dijit+Ev[view] [source] 2025-12-17 18:58:15
>>dang+vu
We all have our moments, and I personally consider HN to be “best effort”, almost like a volunteer project. I’m not certain I’m correct: but thats the optics I have so my expectations are adjusted to that.

So don’t beat yourself up please.

When I worked for “SaaS unicorn” we typically had multiple levels of escalation, and acknowledging would have done nothing because the alarm would continue firing until fixed. Not sure what’s changed in 15 years of ops, I had assumed it would be better now- I can’t imagine silencing an alert totally by acknowledging it- if its still occurring.

I’m totally fine with how you handled it, if anything I am thankful. But that seems to be a system I would improve if I had the time.

“mute” is different than “resolve” to me, and both should exist. (Where mute is an acknowledgement of an issue as ongoing.)

◧◩◪◨⬒
5. scottl+2F[view] [source] 2025-12-17 19:39:44
>>dijit+Ev
This. If it were a business-critical money fountain, I'd expect follow-the-sun SRE coverage. I don't think it is, so I can probably accept drinking my morning coffee without scrolling HN once in a while. There's only so much one can beat oneself up about a slow/incorrect response when the on-call is handled by what, just one person? maybe two people in the same time zone?

(Might be wise though to have PagerDuty configured to re-alert if the outage persists.)

[go to top]