zlacker

Why the move?

Hopefully it’s simply M5 didn’t have a server ready and they’ll migrate back.

Vultr has a great assortment of bare metal servers.

https://www.vultr.com/products/bare-metal/#pricing

>>albert+(OP)
There is also Hetzner, their dedicated server pricing is very good.

>>albert+(OP)
> Why the move?

Our primary server died around 11pm last night (PST), so we switched to our secondary server, but then our secondary server died around 6am, and we didn't have a third.

The plan was always "in the unlikely event that both servers die at the same time, be able to spin HN up on AWS." We knew it would take us several hours to do that, but it seemed an ok tradeoff given how unlikely the both-servers-die-at-the-same-time scenario seemed at the time. (It doesn't seem so unlikely now. In fact it seems to have a probability of 1.)

Given what we knew when we made that plan, I'm pretty pleased with how things have turned out so far (fingers crossed—no jinx—definitely not gloating). We had done dry runs of this and made good-enough notes. It sucks to have been down for 8 hours, but it could have been worse, and without good backups (thank you sctb!) it would have been catastrophic.

Having someone as good as mthurman do most of the work is also a really good idea.

replies(8): >>omegal+S2 >>albert+q3 >>nemoth+y5 >>rstupe+w6 >>mwcamp+lj >>aditya+il >>O_____+wC >>smn123+Ud1

>>jacoop+w1
But they don't have servers in US do they?

replies(1): >>Amfy+z4

>>dang+92
Do you have a postmortem on why both servers died so fast?

replies(1): >>dang+s3

>>dang+92
Speaking for everyone, really appreciate you dang.

Question: so will HN be migrating back to M5 (or another hosting provider).

replies(1): >>dang+E5

>>omegal+S2
It was an SSD that failed in each case, and in a similar way (e.g. both were in RAID arrays but neither could be rebuilt from the array - but I am over my skis in reporting this, as I barely know what that means).

The disks were in two physically separate servers that were not connected to each other. I believe, however, that they were of similar make and model. So the leading hypothesis seems to be that perhaps the SSDs were from the same manufacturing batch and shared some defect. In other words, our servers were inbred! Which makes me want to link to the song 'Second Cousin' by Flamin' Groovies.

The HN hindsight consensus, to judge by the replies to https://news.ycombinator.com/item?id=32026606, is that this happens all the time, is not surprising at all, and is actually quite to be expected. Live and learn!

replies(4): >>whitep+N5 >>hoofhe+Z7 >>loxias+aa >>ksec+191

>>jacoop+w1
they are not very good for hosting user generated content, they suspend very fast for any sort of Abuse complaint. User generated content, will result in some Abuse complaints.

replies(1): >>jacoop+NJ

>>ta988+j2
they do, in Ashburn. I would not host production with them.

replies(2): >>whitep+R5 >>closep+kk

>>dang+92
>The plan was always "in the unlikely event that both servers die at the same time, be able to spin HN up on AWS.

>We had done dry runs of this in the past,

Incredible. Actual disaster recovery.

>>albert+q3
We've only had 5 minutes to talk about it so far, but unless something changes, I don't see why we wouldn't go back to M5.

replies(1): >>booi+oq

>>dang+s3
Do you happen to know the make and model of SSD?

replies(1): >>wolfga+J7

>>Amfy+z4
Why?

replies(1): >>ev1+Uo

>>dang+92
If I can ask what is an sctb?

replies(2): >>wolfga+77 >>quink+p7

>>rstupe+w6
Scott Bell is a former[1] HN moderator: https://news.ycombinator.com/user?id=sctb

[1]: https://news.ycombinator.com/item?id=25055115

>>rstupe+w6
Scott Bell, user sctb.

>>whitep+N5
Just posted to the linked thread:

kabdib> Let me narrow my guess: They hit 4 years, 206 days and 16 hours . . . or 40,000 hours. And that they were sold by HP or Dell, and manufactured by SanDisk.

mikiem> These were made by SanDisk (SanDisk Optimus Lightning II) and the number of hours is between 39,984 and 40,032...

replies(1): >>dang+5c

>>dang+s3
I believe a more plausible scenario could be that each drive failed during the RAID rebuild and restriping process.

This is a known issue in NAS systems, and Freenas always recommended running two raid arrays with 3 disks in each array for mission critical equipment. By doing so, you can lose a disk in each array and keep on trucking without any glitches. Then if you happen to kill another disk during restriping, it would failover to the second mirrored array.

You could hotswap any failed disks in this setup without any downtime. The likelihood of losing 3 drives together in a server would be highly unlikely.

https://www.45drives.com/community/articles/RAID-and-RAIDZ/

replies(2): >>pmoria+HK >>aaaaaa+0X

>>dang+s3
> So the leading hypothesis seems to be that perhaps the SSDs were from the same manufacturing batch and shared some defect.

Really sorry that you had to learn the hard way, but this is unfortunately common knowledge :/ Way back (2004) when I was shadowing-eventually-replacing a mentor that handled infrastructure for a major institution, he gave me a rule I took to heart from then forward: Always diversify. Diversify across manufacturer, diversify across make/model, hell, if it's super important, diversify across _technology stacks_ if you can.

It was policy within our (infrastructure) group that /any/ new server or service must be build-able from at least 2 different sources of components before going live, and for mission critical things, 3 is better. Anything "production" had to be multihomed if it connects to the internet.

Need to build a new storage server service? Get a Supermicro board _and_ a Tyan (or buy an assortment of Dell & IBM), then populate both with an assortment of drives picked randomly across 3 manufacturers, with purchases spread out across time (we used 3months) as well as resellers. Any RAID array with more than 4 drives had to include a hot spare. For even more peace of mind, add a crappy desktop PC with a ton of huge external drives and periodically sync to that.

He also taught me that it's not done until you do a few live "disaster tests" (yanking drives out of fully powered up servers, during heavy IO. Brutally ripping power cables out, quickly plugging it back in, then yanking it out again once you hear the machine doing something, then plug back in...), without giving anyone advance notice. Then, and only then, is a service "done".

I thought "Wow, $MENTOR is really into overkill!!" at the time, but he was right.

I credit his "rules for building infrastructure" for having a zero loss track record when it comes to infra I maintain, my whole life.

replies(3): >>erik_s+Rc >>dang+Zf >>zimpen+Lh

>>wolfga+J7
https://news.ycombinator.com/item?id=32028511

>>loxias+aa
Didn’t Intel grant AMD some kind of license because the US government refused to buy x86 CPU models that only have one source?

replies(1): >>trasz+r96

>>loxias+aa
> this is unfortunately common knowledge

This reminds me of Voltaire: "Common sense is not so common."

Thanks for the great comment—everything you say makes perfect sense and is even obvious in hindsight, but it's the kind of thing that tends to be known by grizzled infrastructure veterans who had good mentors in their chequered past—and not so much by the rest of us.

I fear getting karmically smacked for repeating this too often, but the more I think about it, the more I feel like 8 hours of downtime is not an unreasonable price to pay for this lesson. The opportunity cost of learning it beforehand would have been high as well.

replies(2): >>loxias+Xm >>pmoria+nK

>>loxias+aa
> Way back (2004) [...] he gave me a rule [...]: Always diversify.

Annoyingly, in 2000-4, I was trying to get people to understand this and failing constantly because "it makes more sense if everything is the same - less to learn!" Hilariously*, I also got the blame when things broke even though none of them were my choice or design.

(Hell, even in 2020, I hit a similar issue with a single line Ruby CLI - lots of "everything else uses Python, why is it not Python?" moaning. Because the Python was a lot faffier and less readable!)

edit: to fix the formatting

>>dang+92
If any of you have time to answer this, I'm curious about how you do backups. Are you using frequent rsyncs (or something functionally equivalent like BorgBackup), ZFS snapshots, or something custom? It looks like you must have had frequent backups, since when HN finally came back, it was quite close to the state that I remember it being in when it went down.

replies(1): >>dang+Pk

>>Amfy+z4
That’s a VPS offering only, they don’t do dedicated servers here.

>>mwcamp+lj
Custom, I guess (edit: that was an overstatement—let's say partly custom). We upload a snapshot every 24 hours to S3 and incremental updates every few kb (I think).

We use rsync for log files.

>>dang+92
I am surprised the forum rich of tech people is not yet containerised. A simple helm chart would make deployment to AWS (or any other kubernetes) a 5 min job.

replies(2): >>kijin+Ym >>fulafe+9w

>>dang+Zf
> it's the kind of thing that tends to be known by grizzled infrastructure veterans who had good mentors in their chequered past

And thanks right back at you.

I hadn't noticed before your comment that while not in the customary way (I'm brown skinned and was born into a working class family) I've got TONS of "privilege" in other areas. :D

My life would probably be quite different if I didn't have active Debian and Linux kernel developers just randomly be the older friends helping me in my metaphorical "first steps" with Linux.

Looking back 20+ years ago, I lucked into an absurdly higher than average "floor" when I started getting serious about "computery stuff". Thanks for that. That's some genuine "life perspective" gift you just gave me. I'm smiling. :) I guess it really is hard to see your own privilege.

> 8 hours of downtime is not an unreasonable price to pay for this lesson. The opportunity cost of learning it beforehand would have been high as well.

100% agree.

I'd even say the opportunity cost would have been much higher. Additionally, 8hrs of downtime is still a great "score", depending on the size of the HN organization. (bad 'score' if it's >100 people. amazing 'score' if it's 1-5 people.)

>>aditya+il
I'd assume it takes more than 5 minutes to bring up a complete copy of the HN database on a different platform.

Deploying source code is trivial these days. Large databases, not so much, unless you're already using something like RDS.

replies(1): >>Kronis+nw

>>whitep+R5
I host a lot of production there, but always some layers underneath - like a high CPU or backing storage layer or database, worker nodes that pick up tasks. Never the front user facing.

Do not let any user generated content be accessible from any hetzner IP or you are one email away from shutdown pretty much. Don't forget germany's laws on speech too; they are nothing remotely similar to US. I would host, for example, a corporate site just fine, but last thing ever would be a forum or image hosting site or w/e

replies(1): >>Amfy+pL

>>dang+E5
Well having a double server failure would definitely make me think about going back to the same solution..

replies(1): >>hdjjhh+SB

>>aditya+il
Kubernetes is for making said tech people rich, not for running the 90s style forum web app that said tech people like to use.

>>kijin+Ym
> I'd assume it takes more than 5 minutes to bring up a complete copy of the HN database on a different platform.

Hmm, that actually makes me wonder about how big it would actually be. The nature of HN (not really storing a lot of images/videos like Reddit, for example) would probably lend itself well to being pretty economical in regards to the space used.

Assuming a link of 1 Gbps, ideally you'd be able to transfer close to 125 MB/s. So that'd mean that in 5 minutes you could transfer around 37'500 MB of data to another place, though you have to account for overhead. With compression in place, you might just be able to make this figure a lot better, though that depends on how you do things.

In practice the link speeds will vary (a lot) based on what hardware/hosting you're using, where and how you store any backups and what you use for transferring them elsewhere, if you can do that stuff incrementally then it's even better (scheduled backups of full data, incremental updates afterwards).

Regardless, in an ideal world where you have a lot of information, this would boil down to a mathematical equation, letting you plot how long bringing over all of the data would take for any given DB size (for your current infrastructure/setup). For many systems out there, 5 minutes would indeed be possible - but that becomes less likely the more data you store, or the more complicated components you introduce (e.g. separate storage for binary data, multiple services, message queues with persistence etc.).

That said, in regards to the whole container argument: I think that there are definitely benefits to be had from containerization, as long as you pick a suitable orchestrator (Kubernetes if you know it well from working in a lab setting or with supervision under someone else in a prod setting, or something simpler like Nomad/Swarm that you can prototype things quickly with).

replies(2): >>kijin+JI >>pmoria+yJ

>>booi+oq
I'd say the opposite: having experienced this, they would make sure to significantly reduce the probability of it happening again.

In the past, I had a similar problem because of using hardware from the same batch. In retrospect, it's silly to be surprised they died at the same time.

>>dang+92
Thanks mthurman!!

_____

https://www.linkedin.com/in/markethurman

https://news.ycombinator.com/user?id=mthurman

replies(1): >>O_____+wW

>>Kronis+nw
Network transfer is only a small part of the equation.

You can't just rsync files into a fully managed RDS PostgreSQL or Elasticsearch instance. You'll probably need to do a dump and restore, especially if the source machine has bad disks and/or has been running a different version. This will take much longer than simply copying the files.

Of course you could install the database of your choice in an EC2 box and rsync all you want, but that kinda defeats the purpose of using AWS and containerizing in the first place.

replies(1): >>Kronis+ET

>>Kronis+nw
> Assuming a link of 1 Gbps, ideally you'd be able to transfer close to 125 MB/s. So that'd mean that in 5 minutes you could transfer around 37'500 MB of data to another place, though you have to account for overhead. With compression in place, you might just be able to make this figure a lot better, though that depends on how you do things.

Ideally, all this data would have been already backed up to AWS (or your provider of choice) by the time your primary service failed, so all your have to do is spin up your backup server and your data would be waiting for you.

(Looks like HN does just this: https://news.ycombinator.com/item?id=32032316 )

replies(1): >>Kronis+9U

>>Amfy+w4
I host a forum in it and there is nor problems. Of course a site as big as HN would've different rules and treatment.

>>dang+Zf
> it's the kind of thing that tends to be known by grizzled infrastructure veterans who had good mentors in their chequered past—and not so much by the rest of us

This is why your systems should be designed by grizzled infrastructure veterans.

replies(1): >>dang+yf2

>>hoofhe+Z7
Ideally, there should be redundancy in servers, too.. With different hardware, on different sides of the planet, on different service providers.

replies(1): >>hoofhe+ES

>>ev1+Uo
this. matches my experience and the ones in my circle

>>pmoria+HK
Correct.. Your production server with dual mirrored arrays should have an identical warm spare. If it's in the same data center, then you need a separate offsite backup in case of a worst case disaster such as a tornado, fire, or nuclear strike.

>>kijin+JI
> You can't just rsync files into a fully managed RDS PostgreSQL or Elasticsearch instance. You'll probably need to do a dump and restore, especially if the source machine has bad disks and/or has been running a different version. This will take much longer than simply copying the files.

That is true, albeit not in all cases!

An alternative approach (that has some serious caveats) would be to do full backups of the DB directory, e.g. /var/lib/postgresql/data or /var/lib/mysql (as long as you can prevent invalid state data there) and then just starting up a container/instance with this directory mounted. Of course, that probably isn't possible with most if not all managed DB solutions out there.

>>pmoria+yJ
> Ideally, all this data would have been already backed up to AWS (or your provider of choice) by the time your primary service failed, so all your have to do is spin up your backup server and your data would be waiting for you.

Sure, though the solution where you back up the data probably won't be the same one where the new live DB will actually run, so some data transfer/IO will still be needed.

replies(1): >>pmoria+KY

>>O_____+wC
Thanks sctb!!

(sctb is Scott, former HN mod)

https://news.ycombinator.com/item?id=25055115

https://news.ycombinator.com/user?id=sctb

>>hoofhe+Z7
Unless it was due to end of life, or power surge related failure.

Then more than one failing simultaneously isn't so inconceivable.

>>Kronis+9U
> the solution where you back up the data probably won't be the same one where the new live DB will actually run, so some data transfer/IO will still be needed

The S3 buckets where HN is backed up to could themselves be constantly copied to other S3 buckets which could be the buckets directly used by an EC2 instance, were it ever needed in case of emergency.

That would avoid on-demand data transfer from the backup S3 buckets themselves at the time of failure.

The backup S3 buckets could also be periodically copied to Glacier for long-term storage.

That's for an all-AWS backup solution. Of course you could do this with (for example) another datacenter and tapes, if you wanted to... or another cloud provider.

>>dang+s3
I mean to be fair, had this not been a SSD defect, the probability of four of them dying at the same time ( or in extreme close proximity of a few hours ) is indeed very unlikely. And choosing different SSD vendor would have prevented this happening even if one did have counter overflow. Or in this case the 2nd Server could have been a different Model or vendor.

replies(1): >>dang+3f2

>>dang+92
recovery with very little data loss, well done

>>ksec+191
I'm not sure if 4 failed for that reason, or if only 2 failed for that reason and then the attempts to restore them from the RAID array failed for a different reason.

2 failures within a few hours is unlikely enough already though, unless there was a common variable (which there clearly was).

>>pmoria+nK
That reminds me of Jerry Weinberg's dictum: whenever you hear the word "should" on a software project, replace it with "isn't".

>>590075

replies(2): >>kqr+ty5 >>mst+oT5

>>dang+yf2
That goes along with "almost never" which is a synonym for "sometimes" and "maintenance-free" which is a synonym for "throw it out and buy a new one when it breaks".

replies(1): >>duckmy+Ei6

>>dang+yf2
This is brilliant and I suspect generalises to "won't" when reading RFCs.

>>erik_s+Rc
I believe it was IBM, not the government.

>>kqr+ty5
"Almost never" -> "more often than you would want"