(The reason I did that is that the anti-crawler protections also unfortunately hit some legit users, and we don't want to block legit users. However, it seems that I turned the knobs down too far.)
In this case, though, we had a secondary failure: PagerDuty woke me up at 5:24am, I checked HN and it seemed fine, so I told PagerDuty the problem was resolved. But the problem wasn't resolved - at that point I was just sleeping through it.
I'll add more as we find out more, but it probably won't be till later this afternoon PST.
Edit: later than I expected, but for those still following, the main things I've learned are (1) pkill wasn't able to kill SBCL this time - we have a script that does that when HN stops responding, but it didn't work, so we'll revise the script; and (2) how to get PagerDuty not to let you go back to sleep if your site is actually still down.
We all knew that but I haven't seen any confirmation before this.
I think you're confusing popularity with criticality. I'm sure everyone in here can withstand a few hours without browsing the page.
It's dang's baby at this point, and this is a good thing, as long as HN doesn't affect his life in ways he doesn't want.
However, when something I care about crashes and burns once in a blue moon, I make sure to put the fire out, at least to make it survive till regular hours. Things I care about can be both business and personal, and nobody bugs me for them.
Maybe we shouldn't make any assumptions about people we don't personally know, while we are at it.
You are free what you choose to do with your personal life.
Meanwhile, it is pretty obvious that it's pointless to demand or expect personal sacrifice to maintain unrealistic levels of high-availability in services that are far from critical. I mean, do you honestly believe that these messages you and I are writing are so important to get out that someone must sacrifice their personal time to ensure it is served to the world in this very instant instead of, say, 3 or 6 or 13 hours? Absurd.