Tell HN: HN Moved from M5 to AWS

submitted by 1vuio0+(OP) on 2022-07-09 01:33:16 | 282 points 221 comments
[source] [go to bottom]

After many years of remaining static, HN's IP address changed.^1

Old: 209.216.230.240

New: 50.112.136.166

Perhaps this is temporary.

Little known fact: HN is also available through Cloudflare. Unlike CF, AWS does not support TLS1.3.^2 This is not working while HN uses the AWS IP.

1. Years ago someone on HN tried to argue with me that IP addresses will never stay the same for very long. I used HN as an example of an address that does not change very often. I have been waiting for years. I collect historical DNS data. When I remind HN readers that most site addresses are more static than dynamic, I am basing that statement on evidence i have collected.

2. Across the board, so to speak. Every CF-hosted site I have encountered supports TLS1.3. Not true for AWS. Many (most?^3) only offer TLS1.2.

3. Perhaps a survey is in order.

NOTE: showing posts with links only show all posts

>>ethanw+a1
Nah, the root cause was a double disk failure. Their primary server’s disk failed, and then their failover server failed. https://twitter.com/hnstatus/status/1545409429113229312?s=21...

>>1vuio0+(OP)
For who unaware: M5 hosting https://www.m5hosting.com/ It's not EC2 M5 instance.

>>1vuio0+(OP)
> Unlike CF, AWS does not support TLS1.3. This is not working while HN uses the AWS IP.

This seemed implausible so I looked into it, and it's wrong as stated (at best, it needs to be made more precise to capture what you intended). First, you've mentioned Cloudflare, but the equivalent AWS product (CloudFront) does support TLS 1.3 (https://aws.amazon.com/about-aws/whats-new/2020/09/cloudfron...).

HN isn't behind CloudFront, though, so you probably mean their HTTP(s) load balancers (ALB) don't support TLS 1.3. Even that's an incomplete view of the load balancing picture, since the network load balancers (NLB) do support TLS 1.3, https://aws.amazon.com/about-aws/whats-new/2021/10/aws-netwo....

>>fomine+i2
It also isn't a move to an unreleased Apple processor.

There's https://en.wikipedia.org/wiki/Apple_A5 and https://en.wikipedia.org/wiki/Apple_M1 https://en.wikipedia.org/wiki/Apple_M2

>>solard+14
https://news.ycombinator.com/item?id=28479595

>>matheu+j4
https://news.ycombinator.com/item?id=16076041

4 million requests per day in 2018.

>>1vuio0+(OP)
Why the move?

Hopefully it’s simply M5 didn’t have a server ready and they’ll migrate back.

Vultr has a great assortment of bare metal servers.

https://www.vultr.com/products/bare-metal/#pricing

>>phailh+r9
Unicode has code points for superscript/subscript digits. That one is U+00B9: https://www.compart.com/en/unicode/U+00B9 (So it's "normal text", as far as HN is concerned. Note that HN does filter some things, like emoji.)

I was on macOS when I typed it, there it's Control+Cmd+Space, and then search for "super" which gets close enough.

On my Linux machine, I can either do Compose, ^, 1, or Super+e and then search for it. (But both of these require configuration; either setting a key to be Compose (I sacrifice RAlt), or setting up whatever it is the IME I have is for Super+e.)

>>boolea+m9
6 million per day as of 10 months ago: https://news.ycombinator.com/item?id=28479595

>>omegal+Xc
It was an SSD that failed in each case, and in a similar way (e.g. both were in RAID arrays but neither could be rebuilt from the array - but I am over my skis in reporting this, as I barely know what that means).

The disks were in two physically separate servers that were not connected to each other. I believe, however, that they were of similar make and model. So the leading hypothesis seems to be that perhaps the SSDs were from the same manufacturing batch and shared some defect. In other words, our servers were inbred! Which makes me want to link to the song 'Second Cousin' by Flamin' Groovies.

The HN hindsight consensus, to judge by the replies to https://news.ycombinator.com/item?id=32026606, is that this happens all the time, is not surprising at all, and is actually quite to be expected. Live and learn!

>>1vuio0+(OP)
> someone on HN tried to argue with me that IP addresses will never stay the same for very long

Deeply deeply agree with you. Not whoever was arguing. :) My "dialup" VM in the cloud I use for assorted things has had the same IP for at least 7 years, probably longer. (Thanks Linode!) After a few years, it's honestly not that hard to remember an arbitrary 32bit int. :)

  $ w3m -dump http://846235814   # ;)

>>rstupe+Bg
Scott Bell is a former[1] HN moderator: https://news.ycombinator.com/user?id=sctb

[1]: https://news.ycombinator.com/item?id=25055115

>>dang+xd
I believe a more plausible scenario could be that each drive failed during the RAID rebuild and restriping process.

This is a known issue in NAS systems, and Freenas always recommended running two raid arrays with 3 disks in each array for mission critical equipment. By doing so, you can lose a disk in each array and keep on trucking without any glitches. Then if you happen to kill another disk during restriping, it would failover to the second mirrored array.

You could hotswap any failed disks in this setup without any downtime. The likelihood of losing 3 drives together in a server would be highly unlikely.

https://www.45drives.com/community/articles/RAID-and-RAIDZ/

>>wolfga+Oh
https://news.ycombinator.com/item?id=32028511

>>rubyis+h8
maybe not...

https://aws.amazon.com/solutions/case-studies/reddit-aurora-...

>>metada+de
We don't throttle specific users*. At one time we did (it was called slowbanning) but that's been gone for close to a decade now.

It's a bit disconcerting how often I say "close to a decade now" now.

* Edit: we do rate-limit some accounts (https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...), which throttles how much they can post to the site, but we don't throttle how quickly they can view the site.

>>dang+vp
I'll save someone else the search.

https://en.m.wikipedia.org/wiki/Illegitimi_non_carborundum

>>russel+Kv
https://news.ycombinator.com/item?id=4756677

>>herpde+mv
I guess the same way we switched to this one?

I wrote more about data loss at https://news.ycombinator.com/item?id=32030407 in case that's of interest.

>>f0e4c2+36
Of those "correctly" architected apps, most are not properly tested for the failovers and won't actually work as architected (because of your own bugs or because aws failover stuff has bugs and you can't even test it).

Eg, falls over due to steep traffic spikes caused by outages when autoscaling mechanisms get previously unseen levels of load increases and enter some yoyo oscillation pattern, whole AZ is overloaded because all the failovers from the other failing AZ triggering at once, hit circuit breakers, spin up too slowly to ever pass health checks etc. Or can't detect something becoming glacially slow but not outright failing.

See eg https://www.theverge.com/2021/12/22/22849780/amazon-aws-is-d... & https://www.theverge.com/2020/11/25/21719396/amazon-web-serv... etc (many more examples are out there)

>>Shroud+7H
Why would I object. If it works, I will use it.

For example, I ping 198.41.0.4. I choose to ping that address over all the others, e.g., www.google.com or whatever other people use. That is what I mean by user choice. I know the address is anycasted. Where the packets actually go is not something I get to choose. It would be neat to be able control that, e.g., if source routing actually worked on today's internet. But I have no such expectations.

How do Tor users know that an exit node IP address listed for a foreign country is not anycasted and the server is actually located somewhere else.

Maybe check against a list of anycast prefixes.

http://raw.githubusercontent.com/bgptools/anycast-prefixes/m...

>>dang+ec
Thanks mthurman!!

_____

Appears “mthurman” is Mark Thurman, a software engineer at Y Combinator since 2016; HN profile has no obvious clues.

https://www.linkedin.com/in/markethurman

https://news.ycombinator.com/user?id=mthurman

>>brudge+xA
You don't have to speculate - the Arc forum code is available at http://arclanguage.org.

>>Kronis+sG
> Assuming a link of 1 Gbps, ideally you'd be able to transfer close to 125 MB/s. So that'd mean that in 5 minutes you could transfer around 37'500 MB of data to another place, though you have to account for overhead. With compression in place, you might just be able to make this figure a lot better, though that depends on how you do things.

Ideally, all this data would have been already backed up to AWS (or your provider of choice) by the time your primary service failed, so all your have to do is spin up your backup server and your data would be waiting for you.

(Looks like HN does just this: https://news.ycombinator.com/item?id=32032316 )

>>sillys+G8
> Being down all morning is an impressive recovery time, because they had to provision an EC2 server and transfer all data to it

It looks like the data was already in S3: [1]

The recovery probably would have been a lot faster had they had a fully provisioned EC2 image standing by... which I'd bet they will from now on.

[1] - https://news.ycombinator.com/item?id=32032316

>>O_____+BM
Thanks sctb!!

(sctb is Scott, former HN mod)

https://news.ycombinator.com/item?id=25055115

https://news.ycombinator.com/user?id=sctb

>>1vuio0+(OP)
I wonder if it is time for Server Upgrade.

HN was running on Xeon(R) CPU E5-2637 v4 [1], that is SandyBridge era. A 2 Core CPU serving 6M request a day.

If iPhone had more memory the whole of HN could be served from it.

[1] https://www.intel.com/content/www/us/en/products/sku/64598/i...

>>fomine+i2
thanks for this!

Interesting feature comparisons: https://www.m5hosting.com/iaas-cloud/

so it's a private cloud, not m5 managed cloud environments across multi public cloud providers

>>russel+Zx
Agree, HN’s search returned nothing, possibly because the terms are not English, but no idea.

Try Google too next time if HN’s search doesn’t find what you’re looking for:

https://www.google.com/search?q=site:news.ycombinator.com+Il...

>>19h+Oe1
No - it's enabled by default for all available security policies. CloudFront allows to configure the minimum TLS version - the maximum is always TLS1.3.

https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope...

However HN is not using CloudFront - so this doesn't matter for evaluating why HN is not supporting TLS1.3

>>dang+vh
Gotta get something better than those pizza cutters

https://www.youtube.com/watch?v=3d-OVlIYpuQ

>>pojzon+5z
Per the spreadsheet here https://awsmaniac.com/aws-outages/ :

There seem to have been multiple "full" outages in 2011-12 in AWS' us-east-1 region, which, granted, is the oldest AWS region and likely has a bunch of legacy stuff. By "full" outages I mean that a few core services fell over but the entire region become inaccessible due to those core failures.

>>9wzYQb+IR
https://news.ycombinator.com/item?id=1352355

>>pmoria+sU
That reminds me of Jerry Weinberg's dictum: whenever you hear the word "should" on a software project, replace it with "isn't".

>>590075

zlacker

Tell HN: HN Moved from M5 to AWS