Tell HN: HN Moved from M5 to AWS

Hopefully it’s simply M5 didn’t have a server ready and they’ll migrate back.

Vultr has a great assortment of bare metal servers.

https://www.vultr.com/products/bare-metal/#pricing

>>albert+5a
> Why the move?

Our primary server died around 11pm last night (PST), so we switched to our secondary server, but then our secondary server died around 6am, and we didn't have a third.

The plan was always "in the unlikely event that both servers die at the same time, be able to spin HN up on AWS." We knew it would take us several hours to do that, but it seemed an ok tradeoff given how unlikely the both-servers-die-at-the-same-time scenario seemed at the time. (It doesn't seem so unlikely now. In fact it seems to have a probability of 1.)

Given what we knew when we made that plan, I'm pretty pleased with how things have turned out so far (fingers crossed—no jinx—definitely not gloating). We had done dry runs of this and made good-enough notes. It sucks to have been down for 8 hours, but it could have been worse, and without good backups (thank you sctb!) it would have been catastrophic.

Having someone as good as mthurman do most of the work is also a really good idea.

>>dang+ec
Do you have a postmortem on why both servers died so fast?

>>omegal+Xc
It was an SSD that failed in each case, and in a similar way (e.g. both were in RAID arrays but neither could be rebuilt from the array - but I am over my skis in reporting this, as I barely know what that means).

The disks were in two physically separate servers that were not connected to each other. I believe, however, that they were of similar make and model. So the leading hypothesis seems to be that perhaps the SSDs were from the same manufacturing batch and shared some defect. In other words, our servers were inbred! Which makes me want to link to the song 'Second Cousin' by Flamin' Groovies.

The HN hindsight consensus, to judge by the replies to https://news.ycombinator.com/item?id=32026606, is that this happens all the time, is not surprising at all, and is actually quite to be expected. Live and learn!

>>dang+xd
I believe a more plausible scenario could be that each drive failed during the RAID rebuild and restriping process.

This is a known issue in NAS systems, and Freenas always recommended running two raid arrays with 3 disks in each array for mission critical equipment. By doing so, you can lose a disk in each array and keep on trucking without any glitches. Then if you happen to kill another disk during restriping, it would failover to the second mirrored array.

You could hotswap any failed disks in this setup without any downtime. The likelihood of losing 3 drives together in a server would be highly unlikely.

https://www.45drives.com/community/articles/RAID-and-RAIDZ/

>>hoofhe+4i
Ideally, there should be redundancy in servers, too.. With different hardware, on different sides of the planet, on different service providers.

zlacker