zlacker

> I'd assume it takes more than 5 minutes to bring up a complete copy of the HN database on a different platform.

Hmm, that actually makes me wonder about how big it would actually be. The nature of HN (not really storing a lot of images/videos like Reddit, for example) would probably lend itself well to being pretty economical in regards to the space used.

Assuming a link of 1 Gbps, ideally you'd be able to transfer close to 125 MB/s. So that'd mean that in 5 minutes you could transfer around 37'500 MB of data to another place, though you have to account for overhead. With compression in place, you might just be able to make this figure a lot better, though that depends on how you do things.

In practice the link speeds will vary (a lot) based on what hardware/hosting you're using, where and how you store any backups and what you use for transferring them elsewhere, if you can do that stuff incrementally then it's even better (scheduled backups of full data, incremental updates afterwards).

Regardless, in an ideal world where you have a lot of information, this would boil down to a mathematical equation, letting you plot how long bringing over all of the data would take for any given DB size (for your current infrastructure/setup). For many systems out there, 5 minutes would indeed be possible - but that becomes less likely the more data you store, or the more complicated components you introduce (e.g. separate storage for binary data, multiple services, message queues with persistence etc.).

That said, in regards to the whole container argument: I think that there are definitely benefits to be had from containerization, as long as you pick a suitable orchestrator (Kubernetes if you know it well from working in a lab setting or with supervision under someone else in a prod setting, or something simpler like Nomad/Swarm that you can prototype things quickly with).

replies(2): >>kijin+mc >>pmoria+bd

>>Kronis+(OP)
Network transfer is only a small part of the equation.

You can't just rsync files into a fully managed RDS PostgreSQL or Elasticsearch instance. You'll probably need to do a dump and restore, especially if the source machine has bad disks and/or has been running a different version. This will take much longer than simply copying the files.

Of course you could install the database of your choice in an EC2 box and rsync all you want, but that kinda defeats the purpose of using AWS and containerizing in the first place.

replies(1): >>Kronis+hn

>>Kronis+(OP)
> Assuming a link of 1 Gbps, ideally you'd be able to transfer close to 125 MB/s. So that'd mean that in 5 minutes you could transfer around 37'500 MB of data to another place, though you have to account for overhead. With compression in place, you might just be able to make this figure a lot better, though that depends on how you do things.

Ideally, all this data would have been already backed up to AWS (or your provider of choice) by the time your primary service failed, so all your have to do is spin up your backup server and your data would be waiting for you.

(Looks like HN does just this: https://news.ycombinator.com/item?id=32032316 )

replies(1): >>Kronis+Mn

>>kijin+mc
> You can't just rsync files into a fully managed RDS PostgreSQL or Elasticsearch instance. You'll probably need to do a dump and restore, especially if the source machine has bad disks and/or has been running a different version. This will take much longer than simply copying the files.

That is true, albeit not in all cases!

An alternative approach (that has some serious caveats) would be to do full backups of the DB directory, e.g. /var/lib/postgresql/data or /var/lib/mysql (as long as you can prevent invalid state data there) and then just starting up a container/instance with this directory mounted. Of course, that probably isn't possible with most if not all managed DB solutions out there.

>>pmoria+bd
> Ideally, all this data would have been already backed up to AWS (or your provider of choice) by the time your primary service failed, so all your have to do is spin up your backup server and your data would be waiting for you.

Sure, though the solution where you back up the data probably won't be the same one where the new live DB will actually run, so some data transfer/IO will still be needed.

replies(1): >>pmoria+ns

>>Kronis+Mn
> the solution where you back up the data probably won't be the same one where the new live DB will actually run, so some data transfer/IO will still be needed

The S3 buckets where HN is backed up to could themselves be constantly copied to other S3 buckets which could be the buckets directly used by an EC2 instance, were it ever needed in case of emergency.

That would avoid on-demand data transfer from the backup S3 buckets themselves at the time of failure.

The backup S3 buckets could also be periodically copied to Glacier for long-term storage.

That's for an all-AWS backup solution. Of course you could do this with (for example) another datacenter and tapes, if you wanted to... or another cloud provider.