Tell HN: HN Moved from M5 to AWS

Hopefully it’s simply M5 didn’t have a server ready and they’ll migrate back.

Vultr has a great assortment of bare metal servers.

https://www.vultr.com/products/bare-metal/#pricing

>>albert+5a
> Why the move?

Our primary server died around 11pm last night (PST), so we switched to our secondary server, but then our secondary server died around 6am, and we didn't have a third.

The plan was always "in the unlikely event that both servers die at the same time, be able to spin HN up on AWS." We knew it would take us several hours to do that, but it seemed an ok tradeoff given how unlikely the both-servers-die-at-the-same-time scenario seemed at the time. (It doesn't seem so unlikely now. In fact it seems to have a probability of 1.)

Given what we knew when we made that plan, I'm pretty pleased with how things have turned out so far (fingers crossed—no jinx—definitely not gloating). We had done dry runs of this and made good-enough notes. It sucks to have been down for 8 hours, but it could have been worse, and without good backups (thank you sctb!) it would have been catastrophic.

Having someone as good as mthurman do most of the work is also a really good idea.

>>dang+ec
Do you have a postmortem on why both servers died so fast?

>>omegal+Xc
It was an SSD that failed in each case, and in a similar way (e.g. both were in RAID arrays but neither could be rebuilt from the array - but I am over my skis in reporting this, as I barely know what that means).

The disks were in two physically separate servers that were not connected to each other. I believe, however, that they were of similar make and model. So the leading hypothesis seems to be that perhaps the SSDs were from the same manufacturing batch and shared some defect. In other words, our servers were inbred! Which makes me want to link to the song 'Second Cousin' by Flamin' Groovies.

The HN hindsight consensus, to judge by the replies to https://news.ycombinator.com/item?id=32026606, is that this happens all the time, is not surprising at all, and is actually quite to be expected. Live and learn!

>>dang+xd
> So the leading hypothesis seems to be that perhaps the SSDs were from the same manufacturing batch and shared some defect.

Really sorry that you had to learn the hard way, but this is unfortunately common knowledge :/ Way back (2004) when I was shadowing-eventually-replacing a mentor that handled infrastructure for a major institution, he gave me a rule I took to heart from then forward: Always diversify. Diversify across manufacturer, diversify across make/model, hell, if it's super important, diversify across _technology stacks_ if you can.

It was policy within our (infrastructure) group that /any/ new server or service must be build-able from at least 2 different sources of components before going live, and for mission critical things, 3 is better. Anything "production" had to be multihomed if it connects to the internet.

Need to build a new storage server service? Get a Supermicro board _and_ a Tyan (or buy an assortment of Dell & IBM), then populate both with an assortment of drives picked randomly across 3 manufacturers, with purchases spread out across time (we used 3months) as well as resellers. Any RAID array with more than 4 drives had to include a hot spare. For even more peace of mind, add a crappy desktop PC with a ton of huge external drives and periodically sync to that.

He also taught me that it's not done until you do a few live "disaster tests" (yanking drives out of fully powered up servers, during heavy IO. Brutally ripping power cables out, quickly plugging it back in, then yanking it out again once you hear the machine doing something, then plug back in...), without giving anyone advance notice. Then, and only then, is a service "done".

I thought "Wow, $MENTOR is really into overkill!!" at the time, but he was right.

I credit his "rules for building infrastructure" for having a zero loss track record when it comes to infra I maintain, my whole life.

>>loxias+fk
Didn’t Intel grant AMD some kind of license because the US government refused to buy x86 CPU models that only have one source?

zlacker