zlacker

[return to "HN is up again"]
1. sillys+z[view] [source] 2022-07-08 20:34:23
>>tpmx+(OP)
HN was down because the failover server also failed: https://twitter.com/HNStatus/status/1545409429113229312

Double disk failure is improbable but not impossible.

The most impressive thing is that there seems to be no dataloss, almost whatsoever. Whatever the backup system is, it seems rock solid.

◧◩
2. davedu+b2[view] [source] 2022-07-08 20:41:05
>>sillys+z
> Double disk failure is improbable but not impossible.

It's not even improbable if the disks are the same kind purchased at the same time.

◧◩◪
3. kabdib+iv[view] [source] 2022-07-08 22:34:21
>>davedu+b2
I once had a small fleet of SSDs fail because they had some uptime counters that overflowed after 4.5 years, and that somehow persistently wrecked some internal data structures. It turned them into little, unrecoverable bricks.

It was not awesome seeing a bunch of servers go dark in just about the order we had originally powered them on. Not a fun day at all.

◧◩◪◨
4. mikiem+Lb1[view] [source] 2022-07-09 03:05:29
>>kabdib+iv
You are never going to guess how long the HN SSDs were in the servers... never ever... OK... I'll tell you: 4.5years. I am not even kidding.
◧◩◪◨⬒
5. kabdib+md1[view] [source] 2022-07-09 03:20:11
>>mikiem+Lb1
Let me narrow my guess: They hit 4 years, 206 days and 16 hours . . . or 40,000 hours.

And that they were sold by HP or Dell, and manufactured by SanDisk.

Do I win a prize?

(None of us win prizes on this one).

◧◩◪◨⬒⬓
6. dang+Kj1[view] [source] 2022-07-09 04:22:15
>>kabdib+md1
Wow. It's possible that you have nailed this.

Edit: here's why I like this theory. I don't believe that the two disks had similar levels of wear, because the primary server would get more writes than the standby, and we switched between the two so rarely. The idea that they would have failed within hours of each other because of wear doesn't seem plausible.

But the two servers were set up at the same time, and it's possible that the two SSDs had been manufactured around the same time (same make and model). The idea that they hit the 40,000 hour mark within a few hours of each other seems entirely plausible.

Mike of M5 (mikiem in this thread) told us today that it "smelled like a timing issue" to him, and that is squarely in this territory.

◧◩◪◨⬒⬓⬔
7. tempes+BH1[view] [source] 2022-07-09 08:34:14
>>dang+Kj1
This kind of thing is why I love Hacker News. Someone runs into a strange technical situation, and someone else happens to share their own obscure, related anecdote, which just happens to precisely solve the mystery. Really cool to see it benefit HN itself this time.
◧◩◪◨⬒⬓⬔⧯
8. dang+me3[view] [source] 2022-07-09 20:29:34
>>tempes+BH1
It's also an example of the dharma of /newest – the rising and falling away of stories that get no attention:

HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours - https://news.ycombinator.com/item?id=22706968 - March 2020 (0 comments)

HPE SSD flaw will brick hardware after 40k hours - https://news.ycombinator.com/item?id=22697758 - March 2020 (0 comments)

Some HP Enterprise SSD will brick after 40000 hours without update - https://news.ycombinator.com/item?id=22697001 - March 2020 (1 comment)

HPE Warns of New Firmware Flaw That Bricks SSDs After 40k Hours of Use - https://news.ycombinator.com/item?id=22692611 - March 2020 (0 comments)

HPE Warns of New Bug That Kills SSD Drives After 40k Hours - https://news.ycombinator.com/item?id=22680420 - March 2020 (0 comments)

(there's also https://news.ycombinator.com/item?id=32035934, but that was submitted today)

◧◩◪◨⬒⬓⬔⧯▣
9. dredmo+E64[view] [source] 2022-07-10 06:32:41
>>dang+me3
Popularity is a very poor relevance / truth heuristic.
◧◩◪◨⬒⬓⬔⧯▣▦
10. gpshea+hE5[view] [source] 2022-07-10 19:50:58
>>dredmo+E64
I wanted to upvote this comment but that just feels wrong.
[go to top]