zlacker

I was surprised they only had one backup server, especially given the competitive price of rackmount hardware these days. More replicas needed.

Although this was a fun exercise to learn how lost I feel without HN. Damn.

replies(2): >>dang+12 >>andrea+Fk

>>metada+(OP)
Our thinking was: (1) keep a hot standby to fail over to when we need it—that keeps downtime to seconds in routine cases (like pre-planned maintenance) and minutes or an hour in most failure cases—for example, when our primary server died last night, HN was down for about an hour while we brought up the standby; and (2) In the unlikely event that both the primary and standby servers fail at the same time, be able to bring up a fresh server from backup within hours, not days. The latter case is what happened today, and in the end we were down for just under 8 hours. (Assuming we don't sink back into the pit of hell overnight.)

Assuming things don't fail again in the next day or two, since we still have a lot to take care of (fingers crossed—definitely not gloating), I feel like this was pretty reasonable. We don't have a lot of dev or ops resources—few people work on HN, and only me full-time these days. The more complex one's replica architecture, the higher the maintenance costs. The simplicity of our setup has served us well in the 9 years that we've been running it, and I feel like the tradeoff of "several hours downtime once a decade" is worth it if you draw one of those risk/cost managerial whiteboard things.

replies(3): >>metada+ec >>toast0+5l >>tannha+wp

>>dang+12
@Dang, the state of affairs around here is already more than reasonable. In fact, it's incredible HN is almost never down, and the occasional 8-24 hour interruption every 5-7 years is actually a Good Thing for HN whores (like me) to reflect on how insanely much we are hooked on and love this stupid time sink technomancer site.

Cheers, you are the true and literal soul of the machine embodying the best spirit of the oftentimes beautiful thing that is Post-Paul-Graham HackerNews.

Please just promise to never die.

>>metada+(OP)
One backup server is apparently sufficient, given the primary held up for 4.5 years. The issue was correlation between the primary and the spare which wouldn't have been solved by more replicas anyway.

>>dang+12
It might be worth considering a way to get a we're working on it notice up quickly. (HN status on twitter worked, but it's kind of nicer when something loads at the main address), but an 8 hour outage once a decade for something that's not really critical is pretty good; no need to increase complexity, although try to get some storage diversity for the future, now that you've learned about that.

>>dang+12
I'm guessing that several-hours-outage figure from restoring a full backup could be reduced by restoring the most up-to-date discussions first, then gradually restoring older discussions, then re-indexing for full-text search, all the while running in degraded mode but still having a front page. But tbh I'm just glad it wasn't an attack against free speech in tough times.