Hopefully it’s simply M5 didn’t have a server ready and they’ll migrate back.
Vultr has a great assortment of bare metal servers.
Our primary server died around 11pm last night (PST), so we switched to our secondary server, but then our secondary server died around 6am, and we didn't have a third.
The plan was always "in the unlikely event that both servers die at the same time, be able to spin HN up on AWS." We knew it would take us several hours to do that, but it seemed an ok tradeoff given how unlikely the both-servers-die-at-the-same-time scenario seemed at the time. (It doesn't seem so unlikely now. In fact it seems to have a probability of 1.)
Given what we knew when we made that plan, I'm pretty pleased with how things have turned out so far (fingers crossed—no jinx—definitely not gloating). We had done dry runs of this and made good-enough notes. It sucks to have been down for 8 hours, but it could have been worse, and without good backups (thank you sctb!) it would have been catastrophic.
Having someone as good as mthurman do most of the work is also a really good idea.
The disks were in two physically separate servers that were not connected to each other. I believe, however, that they were of similar make and model. So the leading hypothesis seems to be that perhaps the SSDs were from the same manufacturing batch and shared some defect. In other words, our servers were inbred! Which makes me want to link to the song 'Second Cousin' by Flamin' Groovies.
The HN hindsight consensus, to judge by the replies to https://news.ycombinator.com/item?id=32026606, is that this happens all the time, is not surprising at all, and is actually quite to be expected. Live and learn!
This is a known issue in NAS systems, and Freenas always recommended running two raid arrays with 3 disks in each array for mission critical equipment. By doing so, you can lose a disk in each array and keep on trucking without any glitches. Then if you happen to kill another disk during restriping, it would failover to the second mirrored array.
You could hotswap any failed disks in this setup without any downtime. The likelihood of losing 3 drives together in a server would be highly unlikely.