I definitely spent a non-reasonable amount of time thinking my internet had a problem trying to open HN since it's always just been so constant.
Double disk failure is improbable but not impossible.
The most impressive thing is that there seems to be no dataloss, almost whatsoever. Whatever the backup system is, it seems rock solid.
https://twitter.com/HNStatus/status/1545461511870566400
Disk and fallback server failure. Was definitely a long day for their ops team, on a Friday no less.
Needless to say I opened a new tab, typed "n", and hit enter countless times today before my brain caught up with my muscle memory.
I realized how little of this I find elsewhere in my life - whether through Reddit or even my IRL friend circles.
This realization saddens me - I feel like I shouldn’t have to rely on HN so much to scratch this particular itch.
Perhaps I need to get out more.
I know old posts indicate it’s running on a low core count but high frequency Intel CPUs on FreeBSD and no database (just flat files).
I wonder if it’s still the same.
Anyone been on Slashdot lately? Checked it out too was really nice.
Thank you to everyone who keeps this thing running.
Also I remember the "Why we're going with Rails" story on the front page from before it went down.
> Perhaps I need to get out more.
Another way to look at it is that you have a particular set of interests and HN is the online outlet that serves those interests. There's nothing wrong with that, at all and you don't need to have multiple sources for it. No different than someone who likes to ride bikes owning one bike, or someone who likes to read going to the same local library every week for 10 years.
Couldn't possibly have been HN that was the problem haha
Over a year with no issues. Impressive.
It's not even improbable if the disks are the same kind purchased at the same time.
EDIT: My response was based on some edits that are now removed.
____________
Related:
https://www.newyorker.com/news/letter-from-silicon-valley/th...
Good news for people who were banned, or for posts that didn't get enough momentum :)
edit: Was restored from backup.. so def. dataloss
Not that I deserve or expect one from a free service, but because I enjoy reading postmortems from failures where both the primary and backup systems failed, I like to see what holes I might have in my own failover setup.
https://news.ycombinator.com/item?id=32026565
The ones after it are hours later and usually deleted, until this post (...71).
I reset the router... and HN was still down.
<sniff>
... I might have been more productive than usual today.
Whereas, if HN closes, there is no equivalent replacement available.
If the server went down at XX:XX, and the backup they restored from is also from XX:XX, there isn't dataloss. If the server was down for 8 hours, the last data being 8 hours old isn't dataloss, it's correct.
Does that mean nothing of value was lost?
The latter is understandable, the former would be quite a surprise for such a popular site. That means that the machines have no disk redundancy and the server is going down immediately on disk failure. The fallback server would be the only backup.
It's actually surprisingly common for failover hardware to fail shortly after the primary hardware. It's normally been exposed to similar conditions to what killed the primary and the strain of failing over pushes it over the edge.
This logs lesser ones: https://hn.hund.io/
But I'm too lazy to write the application. I wish there was some SDK I could spin up, like PHPBB back in the days, to have something exactly like HN.
% host news.ycombinator.com
news.ycombinator.com has address 50.112.136.166
and also interesting: DNS TTL is set to 1 (one).
Still, I see no reason for prioritizing that failure mode on a site like HN.
I guess proper redundancy is having different brands of equipment also in some cases.
At the early 00's, when Google went offline I wouldn't believe it, and go check my connection (even if I was fetching other sites at the same time). Looks like nowadays HN is in that place.
Having a RAID5 crash and burn because the backup disk failed during the reconstruction phase after a primary disk failed is a common story.
GitHub page for that project: https://github.com/clintonwoo/hackernews-remix-react
8 hours of downtime in a given year is 99.9%, so only three nines. The major SaaS platforms all are basically at least as resilient as this, and most have more stringent SLAs.
https://web.archive.org/web/20220330032426/https://ops.faith...
NetRange: 50.112.0.0 - 50.112.255.255
CIDR: 50.112.0.0/16
NetName: AMAZON-EC2-USWESTOR
NetHandle: NET-50-112-0-0-1
Parent: NET50 (NET-50-0-0-0-0)
NetType: Direct Allocation
OriginAS: AS14618
Organization: Amazon.com, Inc. (AMAZO-47)(Thankfully, they didn't completely die but just put themselves into read-only)
It feels like we've lost a lot of that observability and immediacy with the cloud. It's not as easy to quickly understand the larger picture. You can understand the state of various services with the web console or command line tools but tracing a path through those services is much less obvious and efficient.
I'm kind of nervous to even discuss this as I wonder if it's just my age showing, especially since I see very people mention this as one of the downsides of various cloud solutions. Maybe I'm just jaded?
However, it takes money and time to keep it around in a not for profit way, so it will be an institution as long as it's funding is the same.
Thanks Dang and company.
I appreciate you all.
@my HN peers:
Have a great weekend and thank you all for being you. I learn a ton here and enjoy the perspectives often found on these pages. It is all high value.
HN is running on an old laptop from Viaweb.
Arc is running under the pg user and it's used as the process supervisor.
The actual web server is a VB app running on Linux through Wine.
The flat files have been migrated to an MS Access DB, also running through Wine.
Early on /. was amazing! Remember Cmdrtaco it all out, often taking us for the ride?
Good times, frequently good discussion.
HN has been better for years now, was better at inception, for the most part.
/. has improved a bit. Good to see, or I caught it on a good day.
It's not always easy, but if you can, you want manufacturer diversity, batch diversity, maybe firmware version diversity[1], and power on time diversity. That adds a lot of variables if you need to track down issues though.
[1] you don't want to have versions with known issues that affect you, but it's helpful to have different versions to diagnose unknown issues.
I was worried that I may actually have to go out and do things instead of lurking here this weekend..
Primary failure: https://news.ycombinator.com/item?id=32024036 Standby failure: https://twitter.com/HNStatus/status/1545409429113229312
WELL TODAY WAS VERY INCONVENIENT LET ME TELL YOU! :)
each server has a pair of mirrored disks, so it seems we're talking about 4 drives failing, not just 2.
On the other hand the primary seems to have gone down 6 hours before the backup server did, so the failures weren't quite simultaneous.
But even comparing the apples to the oranges, this HN status page someone else pointed out https://hn.hund.io/ seems to show that HN has had more than one outage in just the past month. All but today's and last night's being quite short, but still. Sometimes you need some extra complexity if you want to make it to zero downtime overall.
That's not something the HN website needs but I think AWS is doing fine even if that's your point of comparison.
That said, HN does have quality content and the signal/noise is way better than sites designed specifically to keep you addicted.
I do this too, and it's because this site is an addictive slot machine just like every other social networking site. I actually really hate this website, but I'm here almost every day, because I can't seem to break the habit. Neat. It's probably because I have a common impulse control / executive functioning disorder, and the way the front page works exploits some bug in my brain.
Reddit does this to me too. I also hate Reddit.
I like wasting time on HN because it's time not actually wasted :)
And don't get me started on Twitter... Sure there are some gems on twitter but I have to wade through 1000s of tweets of pure nonsense to see them. No thanks. If it's something really great someone will post a link on HN anyway :)
Not doing it for this reason but rather financial ones :) But as I have a totally mixed bunch of sizes I have no RAID and a disk loss would be horrible.
https://www.neoseeker.com/news/18098-64gb-crucial-m4s-crashi...
For load balancing I would consider this very likely because both are equally loaded. But "failover" I would usually consider a scenario where a second server is purely in wait for the primary to fail, in which case it would be virtually unused. Like an active/passive scenario as someone mentioned below.
But perhaps I got my terminology mixed up. I'm not working with servers so much anymore.
Is it more appropriate to call the strategy in this case fallback, or failover? Since the secondary server wasn't running in production until the first one failed, it sounds like fallback?
Perhaps higher reliability strategies would have been instead of having a secondary server, just have more mirrored disks on the main server, to reduce the likelihood of the array being compromised?
Alternatively, to run both the primary and secondary servers in production all the time. But that would presumably merely move the single point of failure to the proxy?
[0] https://aws.amazon.com/builders-library/avoiding-fallback-in...
When I saw hn was down, I double-checked the news to see if a major part of the internet had gone down.
It seems the perfect circumstances to really last. It doesn't have an invasive business model, or investors screaming for ROI either. That's the kind of thing that often leads to user-hostile changes that so often start the decline into oblivion.
Also, I would imagine it's pretty cheap to host, after all it's all very simple text, I don't think it hosts any pictures beside the little Ycombinator logo in the corner :)
It was not awesome seeing a bunch of servers go dark in just about the order we had originally powered them on. Not a fun day at all.
It would be even better if they just keep doing it as they are though <3
I'd argue that this site has a good signal/noise ratio by design and specifically to keep you addicted (where "addicted" means using and constantly returning to the site). This site is just designed to attract people who are put off by the kinds of tricks employed elsewhere
It's great that they were able to spin it up in the cloud for recovery purposes. But it's more legendary on a real server <3
Yes I'm old :P
Why hate this site? Because it contains interesting/useful content often enough to make you come back? That'd be a weird reason to hate the site. I too have a common impulse control/executive functioning disorder, but I don't hate the things that it makes me vulnerable to. If I were feeling resentful, I'd have to put the blame on my condition.
I don't have to ask why you hate reddit, the valid reasons for hating reddit are myriad
It's a pretty universal issue. Companies are just getting better at using it to their advantage.
True appreciation to the team who works to keep it up, high-quality, and impactful.
You know how they say to always test your backups? Always test your failover too.
Or so we tell ourselves.
Wasn't cloud flare down for a few hours recently... Cloud providers don't magically fix outages...
Would you please stop spamming your opinion ? You wrote it once, it's enough.
Thank you very much <3
Edit - HN is on AWS now. https://news.ycombinator.com/item?id=32026571
https://check-host.net/ip-info?host=https://news.ycombinator...
https://search.arin.net/rdap/?query=50.112.136.166
Note: HN has been on M5 hosting for years and they were still there as of 16-hours ago per Dang:
https://news.ycombinator.com/item?id=32024105
During the outage, listed places to check HN related systems, posted them here:
I'm sorry, this kind of thing reeks of point "whoring" to me, and I consider that to be an indefensible thing to do; it's pollution. We can see the site. We know it's up. We don't need to be told. Stop doing things purely to increase your score. This isn't a game. Etc.
My solution: a 3-hour focus mode browser extension.
1. Install the BlockSite chrome extension [1]. 2. In BlockSite settings, add HN, Twitter, and any other distracting sites to the Focus Mode list and set Focus Mode time to 3 hours. 3. Ensure you uninstall all social media apps from your phone 4. When I find myself opening a new tab and typing "n" to get a dopamine hit, I then turn on my 3-hour focus mode.
[1] https://chrome.google.com/webstore/detail/blocksite-block-we...
In all seriousness, at least 2/3rds of the complexity is because of your choice of tools and approach. Terraform alone makes things significantly more complex. If you just want to trigger a deployment, then a Template Spec made from a Bicep file could be banged out in like... an hour.[1]
When in Rome, do as the Romans do. You basically took a Microsoft product and tried to automate it with a bunch of Linux-native tools. Why would you think this would be smooth and efficient?
Have you ever tried automating Linux with VB Script? This is almost the same thing.
[1] Someone had a similar example here using Logic Apps and a Template Spec: https://cloudjourney.medium.com/azure-template-spec-and-logi...
Last post before we went down (2022-07-08 12:46:04 UTC): https://news.ycombinator.com/item?id=32026565
First post once we were back up (2022-07-08 20:30:55 UTC): https://news.ycombinator.com/item?id=32026571 (hey, that's this thread! how'd you do that, tpmx?)
So, 7h 45m of downtime. What we don't know is how many posts (or votes, etc.) happened after our last backup, and were therefore lost. The latest vote we have was at 2022-07-08 12:46:05 UTC, which is about the same as the last post.
There can't be many lost posts or votes, though, because I checked HN Search (https://hn.algolia.com/) just before we brought HN back up, and their most recent comment and story were behind ours. That means our last backup on the ill-fated server was taken after the last API update (HN Search relies on our API), and the API gets updated every 30 seconds.
I'm not saying that's a rock-solid argument, but it suggests that 30 seconds is an upper bound on how much data we lost.
And talk is cheap. I dare you to write a blog post or make a public GitHub repo doing the equivalent work (see Goals section) with your own tools. If you can, I'll be super impressed (not that my admiration is worth anything ).
One thing you'll run into is that AD roles and other authn aren't accessible via ARM templates/Bicep
Definitely thought the same. Then I realized that I'm browsing trough work VPN and had a second thought: what if our admins decided to fight procrastination?
with that said, the comments are the most addictive part of this site.
Were they connected on the same power supply? I had 4 different disks fail at the same time before, but they were all in the same PC... (lightning)
Is your backup system tied to your API? Algolia is a third party service, and streaming the latest HN data to Algolia seems pretty similar to streaming it to a backup system.
And that they were sold by HP or Dell, and manufactured by SanDisk.
Do I win a prize?
(None of us win prizes on this one).
Yes—I'm a bit unclear on what happened there, but that does seem to be the case.
[1] https://twitter.com/HNStatus
[2] https://www.reuters.com/business/media-telecom/rogers-commun...
Plus, it's hard to quantify many cases because there is hard-down and soft-down (partial interruptions).
Unbelievable. Thank you for sharing your experience!
Edit: here's why I like this theory. I don't believe that the two disks had similar levels of wear, because the primary server would get more writes than the standby, and we switched between the two so rarely. The idea that they would have failed within hours of each other because of wear doesn't seem plausible.
But the two servers were set up at the same time, and it's possible that the two SSDs had been manufactured around the same time (same make and model). The idea that they hit the 40,000 hour mark within a few hours of each other seems entirely plausible.
Mike of M5 (mikiem in this thread) told us today that it "smelled like a timing issue" to him, and that is squarely in this territory.
This thread is making me feel a lot less crazy.
I think I’m just realizing I need to find that in more places, regardless of topic focus.
[1] https://www.reddit.com/r/sysadmin/comments/f5k95v/dell_emc_u...
Tell HN: HN Moved from M5 to AWS - >>32030400 - July 2022 (116 comments)
Ask HN: What'd you do while HN was down? - >>32026639 - July 2022 (218 comments)
HN is up again - >>32026571 - July 2022 (314 comments)
I'd particularly look here: >>32026606 and here: >>32031025 .
If you scroll through my comments from today via https://news.ycombinator.com/comments?id=dang&next=32039936, there are additional details. (Sorry for recommending my own comments.)
If you (or anyone) skim through that stuff and have a question that isn't answered there, I'd be happy to take a crack at it.
e.g. Simultaneous Engine Maintenance Increases Operating Risks, Aviation Mechanics Bulletin, September–October 1999 https://flightsafety.org/amb/amb_sept_oct99.pdf
Also, you shouldn't wait for disks to fail to replace them. HN's disks were used for 4.5 years, which is greater than the typical disk lifetime, in my experience. They should have replaced them sooner, one by one, in anticipation of failure. This would also allow them to stagger their disk purchases to avoid similar manufacturing dates.
Hopefully archive.org is involved in archiving HN, though unfortunately archive.org's future itself is in jeopardy.
A long time ago we had a Dell server which was pre setup raid from Dell (don't ask, I didn't order it). Eventually one disk on this server died, what sucked was that the second disk in the raid array also failed only a few minutes later. We had to restore from backup which sucked but to our surprise when we opened the Dell server the two disks had sequential serial numbers. They came from the same batch at the same time. Not a good thing to do when you sell people pre configured raid systems at a mark up...
How so?? This is the first I've heard of it.
I've seen too many dead disks with a perfect SMART. When the numbers go down (or up) and triggers are fired then you are surely need to replace the disk[0], but SMART without warnings just means nothing.
[0] my desktop run for years entirely on the disks removed from the client PCs after a failure. Some of them had a pretty bad SMART, on a couple I needed to move the starting point of the partition a couple GBs further from the sector 0 (otherwise they would stall pretty soon), but overall they worked fine - but I never used them as a reliable storage and I knew I can lose them anytime.
Of course I don't use repurposed drives in the servers.
PS and when I tried to post it I received " We're having some trouble serving your request. Sorry! " Sheesh.
Here are some relevant links:
https://news.ycombinator.com/item?id=31703394
https://decrypt.co/31906/activists-rally-save-internet-archi...
https://www.courtlistener.com/docket/17211300/hachette-book-...
I normally bill for cloud automation advice, but the gist is:
You can automate RBAC/IAM via Bicep or ARM[1], but only for existing groups or system managed identities or user managed identities. This usually covers everything that is typically done for cloud automation.
Note that the initial setup might require "manual" steps to set up the groups and their memberships, but then the rest can be automated. In other words, there's a one-time "prerequisites" step followed by 'n' fully automated deployments.
You can also use templates to deploy groups dynamically[2] if you really need to, but this ought to be rare. The problem with this is that templates are designed to deploy resources, and AAD groups aren't resources.
More generally, your mistake IMHO was to try to automate the automation itself, while side-stepping the Azure-native automation tooling by choosing Terraform+Functions instead of Template Specs with delegated permissions via Azure RBAC. Most of your template is used to deploy the infrastructure to deploy a relatively simple template!
This reminds me of people writing VB Scripts to generate CMD files that generate VB Scripts to trigger more scripts in turn. I wish I was kidding, but a huge enterprise did this seven levels deep for a critical systems-management processes. It broke, and caused massive problems. Don't do this, just KISS and remember https://xkcd.com/1205/
[1] via Microsoft.Authorization/roleAssignments
[2] via Microsoft.Resources/deploymentScripts
I guess it got them some goodwill during Corona but it could cause more damage than it's worth.
I wouldn't have done it, it was not like it was a real value during the pandemic. Those who are really into books and don't care about copyright already know their way to more gray-area sites like LibGen.
HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours - https://news.ycombinator.com/item?id=22706968 - March 2020 (0 comments)
HPE SSD flaw will brick hardware after 40k hours - https://news.ycombinator.com/item?id=22697758 - March 2020 (0 comments)
Some HP Enterprise SSD will brick after 40000 hours without update - https://news.ycombinator.com/item?id=22697001 - March 2020 (1 comment)
HPE Warns of New Firmware Flaw That Bricks SSDs After 40k Hours of Use - https://news.ycombinator.com/item?id=22692611 - March 2020 (0 comments)
HPE Warns of New Bug That Kills SSD Drives After 40k Hours - https://news.ycombinator.com/item?id=22680420 - March 2020 (0 comments)
(there's also https://news.ycombinator.com/item?id=32035934, but that was submitted today)
> I normally bill for cloud automation advice, but the gist is
Can you please omit supercilious swipes from your comments here? Everybody knows different things. If you know more than someone else about $thing, that's great—but please don't put them down for it. That's not in the spirit of kindness and curious conversation that we're hoping for here.
https://news.ycombinator.com/newsguidelines.html
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...
This one is just ... maddening.
Then the people under them who do give a shit, because they depend on those servers, aren’t allowed to register with HP etc for updates, or to apply firmware updates, because “separation of duties”.
Basically, IT is cancer from the head down.
The lesson I learned is that the three replacements went to different arrays and we never again let drives from the same batch be part of the same array.
It makes you lose data and need to purchase new hardware, where I come from, that's usually referred to as "planned" or "convenient" obsolescence.
Of course there's no law that says SSD firmware writers can't be rookies.
Both planned and convenient obsolescence are beneficial to device manufacturers. Without proper accountability for that, it only becomes a normal practice.
The manufacturer, obviously. Who else would it be?
Could be an innocent mistake or a deliberate decision. Further action should be predicated on the root cause. Which includes intent.
Perfectly acceptable.