Double disk failure is improbable but not impossible.
The most impressive thing is that there seems to be no dataloss, almost whatsoever. Whatever the backup system is, it seems rock solid.
Also I remember the "Why we're going with Rails" story on the front page from before it went down.
It's not even improbable if the disks are the same kind purchased at the same time.
EDIT: My response was based on some edits that are now removed.
Good news for people who were banned, or for posts that didn't get enough momentum :)
edit: Was restored from backup.. so def. dataloss
If the server went down at XX:XX, and the backup they restored from is also from XX:XX, there isn't dataloss. If the server was down for 8 hours, the last data being 8 hours old isn't dataloss, it's correct.
The latter is understandable, the former would be quite a surprise for such a popular site. That means that the machines have no disk redundancy and the server is going down immediately on disk failure. The fallback server would be the only backup.
It's actually surprisingly common for failover hardware to fail shortly after the primary hardware. It's normally been exposed to similar conditions to what killed the primary and the strain of failing over pushes it over the edge.
Those responsible for the sacking have also been sacked.
Still, I see no reason for prioritizing that failure mode on a site like HN.
I guess proper redundancy is having different brands of equipment also in some cases.
Having a RAID5 crash and burn because the backup disk failed during the reconstruction phase after a primary disk failed is a common story.
https://web.archive.org/web/20220330032426/https://ops.faith...
(Thankfully, they didn't completely die but just put themselves into read-only)
However, it takes money and time to keep it around in a not for profit way, so it will be an institution as long as it's funding is the same.
It's not always easy, but if you can, you want manufacturer diversity, batch diversity, maybe firmware version diversity[1], and power on time diversity. That adds a lot of variables if you need to track down issues though.
[1] you don't want to have versions with known issues that affect you, but it's helpful to have different versions to diagnose unknown issues.
Primary failure: https://news.ycombinator.com/item?id=32024036 Standby failure: https://twitter.com/HNStatus/status/1545409429113229312
each server has a pair of mirrored disks, so it seems we're talking about 4 drives failing, not just 2.
On the other hand the primary seems to have gone down 6 hours before the backup server did, so the failures weren't quite simultaneous.
Not doing it for this reason but rather financial ones :) But as I have a totally mixed bunch of sizes I have no RAID and a disk loss would be horrible.
https://www.neoseeker.com/news/18098-64gb-crucial-m4s-crashi...
For load balancing I would consider this very likely because both are equally loaded. But "failover" I would usually consider a scenario where a second server is purely in wait for the primary to fail, in which case it would be virtually unused. Like an active/passive scenario as someone mentioned below.
But perhaps I got my terminology mixed up. I'm not working with servers so much anymore.
It seems the perfect circumstances to really last. It doesn't have an invasive business model, or investors screaming for ROI either. That's the kind of thing that often leads to user-hostile changes that so often start the decline into oblivion.
Also, I would imagine it's pretty cheap to host, after all it's all very simple text, I don't think it hosts any pictures beside the little Ycombinator logo in the corner :)
It was not awesome seeing a bunch of servers go dark in just about the order we had originally powered them on. Not a fun day at all.
It would be even better if they just keep doing it as they are though <3
You know how they say to always test your backups? Always test your failover too.
Last post before we went down (2022-07-08 12:46:04 UTC): https://news.ycombinator.com/item?id=32026565
First post once we were back up (2022-07-08 20:30:55 UTC): https://news.ycombinator.com/item?id=32026571 (hey, that's this thread! how'd you do that, tpmx?)
So, 7h 45m of downtime. What we don't know is how many posts (or votes, etc.) happened after our last backup, and were therefore lost. The latest vote we have was at 2022-07-08 12:46:05 UTC, which is about the same as the last post.
There can't be many lost posts or votes, though, because I checked HN Search (https://hn.algolia.com/) just before we brought HN back up, and their most recent comment and story were behind ours. That means our last backup on the ill-fated server was taken after the last API update (HN Search relies on our API), and the API gets updated every 30 seconds.
I'm not saying that's a rock-solid argument, but it suggests that 30 seconds is an upper bound on how much data we lost.
Were they connected on the same power supply? I had 4 different disks fail at the same time before, but they were all in the same PC... (lightning)
Is your backup system tied to your API? Algolia is a third party service, and streaming the latest HN data to Algolia seems pretty similar to streaming it to a backup system.
And that they were sold by HP or Dell, and manufactured by SanDisk.
Do I win a prize?
(None of us win prizes on this one).
Yes—I'm a bit unclear on what happened there, but that does seem to be the case.
Unbelievable. Thank you for sharing your experience!
Edit: here's why I like this theory. I don't believe that the two disks had similar levels of wear, because the primary server would get more writes than the standby, and we switched between the two so rarely. The idea that they would have failed within hours of each other because of wear doesn't seem plausible.
But the two servers were set up at the same time, and it's possible that the two SSDs had been manufactured around the same time (same make and model). The idea that they hit the 40,000 hour mark within a few hours of each other seems entirely plausible.
Mike of M5 (mikiem in this thread) told us today that it "smelled like a timing issue" to him, and that is squarely in this territory.
This thread is making me feel a lot less crazy.
[1] https://www.reddit.com/r/sysadmin/comments/f5k95v/dell_emc_u...
e.g. Simultaneous Engine Maintenance Increases Operating Risks, Aviation Mechanics Bulletin, September–October 1999 https://flightsafety.org/amb/amb_sept_oct99.pdf
Also, you shouldn't wait for disks to fail to replace them. HN's disks were used for 4.5 years, which is greater than the typical disk lifetime, in my experience. They should have replaced them sooner, one by one, in anticipation of failure. This would also allow them to stagger their disk purchases to avoid similar manufacturing dates.
Hopefully archive.org is involved in archiving HN, though unfortunately archive.org's future itself is in jeopardy.
A long time ago we had a Dell server which was pre setup raid from Dell (don't ask, I didn't order it). Eventually one disk on this server died, what sucked was that the second disk in the raid array also failed only a few minutes later. We had to restore from backup which sucked but to our surprise when we opened the Dell server the two disks had sequential serial numbers. They came from the same batch at the same time. Not a good thing to do when you sell people pre configured raid systems at a mark up...
How so?? This is the first I've heard of it.
I've seen too many dead disks with a perfect SMART. When the numbers go down (or up) and triggers are fired then you are surely need to replace the disk[0], but SMART without warnings just means nothing.
[0] my desktop run for years entirely on the disks removed from the client PCs after a failure. Some of them had a pretty bad SMART, on a couple I needed to move the starting point of the partition a couple GBs further from the sector 0 (otherwise they would stall pretty soon), but overall they worked fine - but I never used them as a reliable storage and I knew I can lose them anytime.
Of course I don't use repurposed drives in the servers.
PS and when I tried to post it I received " We're having some trouble serving your request. Sorry! " Sheesh.
Here are some relevant links:
https://news.ycombinator.com/item?id=31703394
https://decrypt.co/31906/activists-rally-save-internet-archi...
https://www.courtlistener.com/docket/17211300/hachette-book-...
I guess it got them some goodwill during Corona but it could cause more damage than it's worth.
I wouldn't have done it, it was not like it was a real value during the pandemic. Those who are really into books and don't care about copyright already know their way to more gray-area sites like LibGen.
HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours - https://news.ycombinator.com/item?id=22706968 - March 2020 (0 comments)
HPE SSD flaw will brick hardware after 40k hours - https://news.ycombinator.com/item?id=22697758 - March 2020 (0 comments)
Some HP Enterprise SSD will brick after 40000 hours without update - https://news.ycombinator.com/item?id=22697001 - March 2020 (1 comment)
HPE Warns of New Firmware Flaw That Bricks SSDs After 40k Hours of Use - https://news.ycombinator.com/item?id=22692611 - March 2020 (0 comments)
HPE Warns of New Bug That Kills SSD Drives After 40k Hours - https://news.ycombinator.com/item?id=22680420 - March 2020 (0 comments)
(there's also https://news.ycombinator.com/item?id=32035934, but that was submitted today)
This one is just ... maddening.
Then the people under them who do give a shit, because they depend on those servers, aren’t allowed to register with HP etc for updates, or to apply firmware updates, because “separation of duties”.
Basically, IT is cancer from the head down.
The lesson I learned is that the three replacements went to different arrays and we never again let drives from the same batch be part of the same array.
It makes you lose data and need to purchase new hardware, where I come from, that's usually referred to as "planned" or "convenient" obsolescence.
Of course there's no law that says SSD firmware writers can't be rookies.
Both planned and convenient obsolescence are beneficial to device manufacturers. Without proper accountability for that, it only becomes a normal practice.
The manufacturer, obviously. Who else would it be?
Could be an innocent mistake or a deliberate decision. Further action should be predicated on the root cause. Which includes intent.