HN is up again - zlacker

>>tpmx+(OP)
Yay. Thank you HN.

>>tpmx+(OP)
Oh noes, now my productivity is in the toilet.

replies(1): >>bell-c+Dc

>>tpmx+(OP)
That was a rough 6 hours

replies(3): >>tpmx+d1 >>hanswo+g3 >>dboreh+u6

>>tpmx+(OP)
it is back! I wonder what happened.

replies(1): >>jmt_+K

>>tpmx+(OP)
Thanks for all the daily work in running and keeping up the site, which we too easily take for granted, even those of us with similar jobs.

>>tpmx+(OP)
Everyone, you're welcome. My last F5 must've jogged it loose finally.

replies(7): >>raydia+B >>badrab+F1 >>leeoni+J2 >>DamnIn+H3 >>gedy+k4 >>hguant+p7 >>proact+u7

>>tpmx+(OP)
Hurray! Is this the longest downtime ever for HN? I don't remember it being down for this long in the past, anyone know?

>>tpmx+(OP)
My routine is usually to check HN first thing in the morning when sitting down at my computer before work.

I definitely spent a non-reasonable amount of time thinking my internet had a problem trying to open HN since it's always just been so constant.

>>tpmx+(OP)
Life felt so meaningless. Don't ever do this to me again.

>>tpmx+(OP)
RCA? AAR? "Ooops someone tripped over the cord?"

>>tpmx+(OP)
Just when I was starting to get some work done. Thank you!

>>tpmx+(OP)
I nearly died... of boredom...

>>tpmx+(OP)
Ong i couldnt stand being that productive so relieved

>>tpmx+(OP)
HN was down because the failover server also failed: https://twitter.com/HNStatus/status/1545409429113229312

Double disk failure is improbable but not impossible.

The most impressive thing is that there seems to be no dataloss, almost whatsoever. Whatever the backup system is, it seems rock solid.

replies(12): >>hunter+g1 >>dboreh+m1 >>pastor+x1 >>davedu+b2 >>digita+q5 >>chippi+F5 >>deltar+m6 >>marcos+D9 >>bell-c+sd >>tgflyn+Hp >>Pakdef+u21 >>sschue+nS1

>>tpmx+(OP)
I assume this post will be flagged as off topic, but I actually just went to visit moments before it came back up. I didn't figure out that it wasn't an issue with my internet connection until I saw this post.

>>functi+h
Thank you

>>tpmx+(OP)
This was traumatic :). Felt like I had lost a part of me lol. Glad to be back.

>>tpmx+(OP)
How many times did people refresh to check this today? I did at least four times. I might have a bit of a problem.

replies(3): >>beckin+P1 >>marcos+ma >>timbit+fs

>>tpmx+(OP)
The muscle memory that kept opening HN was strong today. Thank you for bringing it back.

>>tpmx+(OP)
Did you turn it off and on again ?

>>tpmx+(OP)
Top post in just 3 minutes

Edit: 100 points in 4 minutes

>>crorel+d
https://twitter.com/HNStatus/status/1545409429113229312

https://twitter.com/HNStatus/status/1545461511870566400

Disk and fallback server failure. Was definitely a long day for their ops team, on a Friday no less.

>>tpmx+(OP)
Thanks for fixing the site! Keep up the good work.

>>tpmx+(OP)
I was wondering if HN was running on Rogers

replies(1): >>UncleO+dl

>>tpmx+(OP)
Thought my RSS reader was broken because I came to the site and saw it was up, then saw this headline.

>>tpmx+(OP)
Obligatory "can't wait for postmortem" comment

>>tpmx+(OP)
I finally started to have a social life... it didn't last.

>>tpmx+(OP)
I don't think I had ever fully internalized how often I open this site throughout the day. Finish a task? HN. Got frustrated/stuck on a problem? HN break. Waiting for something to install/upload/compile/etc? HN.

Needless to say I opened a new tab, typed "n", and hit enter countless times today before my brain caught up with my muscle memory.

replies(17): >>m4tthu+A1 >>beckin+D1 >>robror+32 >>bragr+j2 >>Mixtap+p4 >>christ+o7 >>chriss+48 >>Overto+fa >>teecee+Ba >>xoa+2b >>smcl+Ei >>lemonc+il >>cheese+cr >>LoveMo+DA >>taylod+yB >>agnos+OC >>pmoler+kS

>>tpmx+(OP)
This downtime made me realize (again) how much I appreciate the kind of interesting topics that show up here, the depth of discussion, and a general attitude of good faith that (most) engage with here.

I realized how little of this I find elsewhere in my life - whether through Reddit or even my IRL friend circles.

This realization saddens me - I feel like I shouldn’t have to rely on HN so much to scratch this particular itch.

Perhaps I need to get out more.

replies(6): >>telesi+l1 >>rco878+K1 >>baby+e7 >>hluska+f9 >>adamre+Nq >>autoex+zB

>>tpmx+(OP)
I’d love to learn more about HN hardware & software stack.

I know old posts indicate it’s running on a low core count but high frequency Intel CPUs on FreeBSD and no database (just flat files).

I wonder if it’s still the same.

replies(1): >>oblio+Nh

>>tpmx+(OP)
Went to Reddit instead today more than I should (not proud).

Anyone been on Slashdot lately? Checked it out too was really nice.

replies(2): >>double+Ut >>autoex+CC

>>tpmx+(OP)
I've been so used to using hn as my "is internet working" page that for a while I thought my internet was just down and tried to troubleshoot it haha.

>>tpmx+(OP)
Even accounting for this outage, most other SAAS platforms still can't compete with HN's non-existent SLA.

Thank you to everyone who keeps this thing running.

replies(1): >>vntok+Mb

>>jdolin+a
It really was.

replies(1): >>sabjut+62

>>tpmx+(OP)
The last post[1] before this one was posted at 12:45:10 UTC. This current post was made at 20:30:55 UTC, so that's a gap of 7 hours 45 minutes and 45 seconds.

[1] https://news.ycombinator.com/item?id=32026548

replies(1): >>TimWol+O1

>>sillys+z
What was the test to determine the dataloss?

replies(2): >>Cerium+r1 >>sillys+z1

>>tpmx+(OP)
It was really funny to read advices to move HN to cloud from all those "experts".

>>haswel+Y
Enjoy it while it's here. When it's gone or you have to move on, we'll miss you too and that's the meaningful part of life.

>>sillys+z
CP => !A

>>hunter+g1
I came to the same conclusion by observing that there are posts and comments from only eight hours ago.

replies(1): >>jbvers+G2

>>tpmx+(OP)
This was a rough one. I actually opened slashdot at one point.

replies(1): >>ddingu+Gi

>>sillys+z
is dang pushing changes and such on his own?

sounds like it is run by one guy

replies(4): >>sillys+q2 >>jacque+u2 >>swyx+X2 >>dang+UY

>>hunter+g1
Informal. My last upvote was pretty close to when HN went down, so I expected my karma to go down, but it didn't.

Also I remember the "Why we're going with Rails" story on the front page from before it went down.

>>joshst+W
I realised this today too. I imagine pretty much all of us are the same!

>>joshst+W
Makes you appreciate the noprocrast mode feature.

replies(3): >>m4tthu+Z1 >>BbzzbB+F3 >>omoika+N3

>>functi+h
Thank you for your service.

>>tpmx+(OP)
A huge thank you to everyone who keeps this amazing site running :)

>>haswel+Y
> This realization saddens me - I feel like I shouldn’t have to rely on HN so much to scratch this particular itch.

> Perhaps I need to get out more.

Another way to look at it is that you have a particular set of interests and HN is the online outlet that serves those interests. There's nothing wrong with that, at all and you don't need to have multiple sources for it. No different than someone who likes to ride bikes owning one bike, or someone who likes to read going to the same local library every week for 10 years.

replies(2): >>avalys+e4 >>haswel+0q1

>>tpmx+(OP)
Spent so long debugging my home network..

Couldn't possibly have been HN that was the problem haha

replies(3): >>D-Code+23 >>pbhjpb+e5 >>tekacs+4x

>>thamer+f1
There's this comment which is newer: https://news.ycombinator.com/item?id=32026565. There's also comments on some test submission: https://news.ycombinator.com/item?id=32026568 that itself got deleted: https://news.ycombinator.com/item?id=32026567

>>cables+E
like 40 times. Probably more.

>>tpmx+(OP)
I'm traveling, so I just assumed my internet had died instead and didn't refresh all day. I literally just found out HN was down by this post.

>>tpmx+(OP)
Just installed a new router and was trying to browse HN to check if my internet connection was working properly, lol

>>tpmx+(OP)
This downtime is a wakeup call for many

replies(1): >>m0llus+af

>>beckin+D1
Tried it once, never again!

replies(3): >>functi+h2 >>beckin+i2 >>divbze+ha

>>joshst+W
Amen.

>>tpmx+d1
Yeah, I had to _actually work_ for once. Glad that I can finally distract myself again :)

>>tpmx+(OP)
My first thought was that my own Internet was down. My second was that HN was somehow dependent on Rogers. Hard to go through a Friday afternoon without HN!

>>tpmx+(OP)
Previous post on the HN status twitter account is from June 2021.

Over a year with no issues. Impressive.

https://twitter.com/HNStatus

replies(1): >>jve+e6

>>sillys+z
> Double disk failure is improbable but not impossible.

It's not even improbable if the disks are the same kind purchased at the same time.

replies(6): >>spiffy+c7 >>kabdib+iv >>0xbadc+uC >>adrian+yD >>bink+BM >>perilu+AF1

>>m4tthu+Z1
I've been meaning to try that feature. Will probably check it out next week.

replies(1): >>Dnguye+q4

>>m4tthu+Z1
Yeah the delirium tremens are rough.

>>joshst+W
noprocrast exists for a reason :)

replies(1): >>slateg+Y3

>>tpmx+(OP)
Worst couple of hours of my life.

>>pastor+x1
HN will be around a hundred years. I think it's more than just a forum. We've seen lots of people coordinate during disasters, for example. Dan and his team do a good job running it. (I'm not a part of it.)

EDIT: My response was based on some edits that are now removed.

replies(1): >>rat998+R6

>>tpmx+(OP)
I've realized that I check if my internet works by opening HN...

>>tpmx+(OP)
I wonder if things like double disk failures and restores from backup make HNers of the "bare metal 4ever" tribe revisit their cloud hatred.

replies(1): >>wumpus+46

>>pastor+x1
What makes you think that? That's just a tweet from an unrelated account.

replies(1): >>pastor+B2

>>tpmx+(OP)
I did some work. Please don't let this happen again.

>>jacque+u2
Nevermind, I thought the OP ran that twitter account

>>tpmx+(OP)
Thank you Dang!!

____________

https://news.ycombinator.com/user?id=dang

>>Cerium+r1
So that means dataloss.. Probably restored from backup.

Good news for people who were banned, or for posts that didn't get enough momentum :)

edit: Was restored from backup.. so def. dataloss

replies(2): >>joshua+15 >>dang+SZ

>>tpmx+(OP)
And all those poor project managers at the end of 2022 wondering what on earth they did right on the 8th of July that caused productivity to reach previously unthinkable heights.

>>functi+h
are you wearing a cape, or are you not?

>>tpmx+(OP)
Will you be posting a postmortem?

Not that I deserve or expect one from a free service, but because I enjoy reading postmortems from failures where both the primary and backup systems failed, I like to see what holes I might have in my own failover setup.

replies(1): >>dang+mt1

>>tpmx+(OP)
While the naysayers will say, "Why isn't this in the cloud?," I think the response times and uptime of hackernews is really impressive. If anyone has a write-up of the infrastructure that runs HN, I would be interested. Maybe startups really can be run off of a rasberry pi

replies(5): >>pyb+u3 >>tpmx+gr >>NKosma+pv >>lambda+kH >>clepto+751

>>tpmx+(OP)
Feels snappier. :)

>>pastor+x1
theres two people fulltime on it but dang appears to be both DBA and SRE

replies(1): >>openth+dh

>>tpmx+(OP)
I think I found the last one before the outage:

https://news.ycombinator.com/item?id=32026565

The ones after it are hours later and usually deleted, until this post (...71).

>>humanw+M1
All sites broke on my machine this morning.

I reset the router... and HN was still down.

<sniff>

>>tpmx+(OP)
So, why and what lessons were learnt? We get to post about others having that problems

>>jdolin+a
I went for a long walk. When I came back, HN was still down. :(

>>tpmx+(OP)
Whenever I get frustrated by cloud complexity I wonder if its all worth it, as HN, stackoverflow, camelx3 etc are still on real servers. Maybe it is worth it after all.

replies(1): >>geoffe+bg

>>tpmx+(OP)
Did global productivity go up today, or were we all hopelessly clicking refresh to see if HN was back up?

replies(1): >>bawolf+V3

>>olinge+S2
AWS has had more outages than HN in recent times

replies(3): >>mpyne+ge >>fartca+rs >>qwerty+NM

>>beckin+D1
CTRL-SHIFT-n (or -p) -> n -> ENTER

replies(1): >>grepfr+jf

>>functi+h
A refreshing take!

>>beckin+D1
Doesn't help users who are usually not logged in.

... I might have been more productive than usual today.

>>tpmx+(OP)
Phew, we're back!

>>jonnyc+q3
Rogers engineers too busy refreshing ;)

>>bragr+j2
Does it work even when you're logged out?

replies(1): >>timeon+gc

>>tpmx+(OP)
HN was down?

>>rco878+K1
It is very different from your examples. Even if you only own one bike, there are innumerable others in existence and companies making new ones every day, if yours is destroyed or lost. Similarly, there are plenty of local libraries to choose from, even if your favorite one closes.

Whereas, if HN closes, there is no equivalent replacement available.

replies(1): >>rco878+5b

>>functi+h
Name almost checks out

>>joshst+W
I was in almost the exact same position all day. What made it worse though was the fact that this happened right in the middle of my attempts at curbing my browsing habits. Once my app timers for Reddit is Fun, Instagram, and Twitter were up, it was time for HN... except there was no HN. What that meant is that I was reaching for a stimulus and then not getting it, the same way that an alcoholic wouldn't feel satisfied by, say, a can of soda. It was weird to experience, but very enlightening. It both made me realize how subconsciously my addiction is reinforced and reaffirmed to me that it is, in fact, an addiction. I'm not going to stop using HN of course, but I'm definitely going to be more aware of how I use it (e.g. passively vs. intentionally) from now on.

replies(2): >>avidph+Xm >>GekkeP+ys

>>functi+h2
I see what you did there! haha

>>tpmx+(OP)
but why was it down though

>>jbvers+G2
> So that means dataloss.. Probably restored from backup.

If the server went down at XX:XX, and the backup they restored from is also from XX:XX, there isn't dataloss. If the server was down for 8 hours, the last data being 8 hours old isn't dataloss, it's correct.

>>humanw+M1
Same, I ended up updating my pihole which was long overdue. Just finished, loaded up HN and it worked - "huh, wonder what the problem was with my pihole" I thought ... well it needed doing anyway.

>>tpmx+(OP)
No news reports, press coverage or the majority of world caring about this 'site' going down for longer than usual.

Does that mean nothing of value was lost?

>>sillys+z
By second disk failure do they mean that the disks on both the primary and fallback servers failed? Or do they mean that two disks (of a RAID1 or similar setup) in the fallback server failed?

The latter is understandable, the former would be quite a surprise for such a popular site. That means that the machines have no disk redundancy and the server is going down immediately on disk failure. The fallback server would be the only backup.

replies(2): >>spiffy+qm >>dang+Le1

>>sillys+z
> Double disk failure is improbable but not impossible.

It's actually surprisingly common for failover hardware to fail shortly after the primary hardware. It's normally been exposed to similar conditions to what killed the primary and the strain of failing over pushes it over the edge.

replies(1): >>GekkeP+8u

>>jeffbe+t2
No.

>>tpmx+(OP)
What a ride. HN should shut down once a month on purpose on a random working day just to allow us to recalibrate our inner compass a bit.

>>beckin+a2
Twitter doesn't bother with minor issues, but there were some for sure.

This logs lesser ones: https://hn.hund.io/

>>sillys+z
>the failover server also failed

Those responsible for the sacking have also been sacked.

>>jdolin+a
Someone suggested lobste.rs, which I hadn't come across before. It needs an invite though so read-only today for me.

>>sillys+q2
You are overstimating HN way too much.

replies(1): >>nominu+n7

>>tpmx+(OP)
Not S3 static site hosting + DynamoDB?

>>davedu+b2
Yep: if you buy a pair disks together, there's a fair chance they'll both be from the same manufacturing batch, which correlates with disk defects.

replies(5): >>clinto+4a >>bragr+Ya >>GekkeP+rt >>sofixa+EJ1 >>dspill+ww3

>>haswel+Y
I've been wanting to create something similar but for cryptocurrencies. A place with no scam/bullshit posts and only deeply technical discussions about the latest trends in zero-knowledge proofs, consensus protocols, scalability challenges, etc.

But I'm too lazy to write the application. I wish there was some SDK I could spin up, like PHPBB back in the days, to have something exactly like HN.

>>rat998+R6
A hundred years, I give it 10 tops.

replies(2): >>sillys+q8 >>nunez+9O

>>joshst+W
Yep. Same. It says a lot about the quality of HN, I think. Also, I can't remember the last time it was down. For a while, I thought there must be something wrong with my internet connection or DNS config or something.

replies(7): >>devonb+Lb >>joncp+Ob >>jwdunn+2q >>butter+Xq >>ijidak+3v >>hooand+oS >>owl57+c01

>>functi+h
07

replies(1): >>smarx0+3M

>>functi+h
You can't fool us, we know you're an F7.

>>joshst+W
I legitimately thought my internet was down for thirty minutes until I decided to try google.

replies(1): >>the_af+fe

>>tpmx+(OP)
Interesting though, HN is now hosted on AWS and no longer on bare metal (m5 hosting).

% host news.ycombinator.com

news.ycombinator.com has address 50.112.136.166

and also interesting: DNS TTL is set to 1 (one).

replies(1): >>bestin+se

>>nominu+n7
It’s already been around since 2007. How many decades does HN need to be around before people realize it’s an institution?

replies(2): >>grepfr+cg >>jethro+Vg

>>haswel+Y
The beautiful part of the internet is that it provides space for people to share incredibly niche interests. For all of its problems and complications, that beauty still exists.

>>tpmx+(OP)
Its weird how I saw this headline, thought to myself "Oh good! It's back!" before realizing that I was using HN to see this very headline...

>>sillys+z
If you have an active/passive HA setup and don't test it periodically (by taking the active server offline and switching them afterwards), my guess is that double disk failures will be more common than single disk failures for you.

Still, I see no reason for prioritizing that failure mode on a site like HN.

>>tpmx+(OP)
Had to take a bunch of Xanex to get through the data without HN.

>>spiffy+c7
This makes total sense but I've never heard of it. Is there any literature or writing about this phenomenon?

I guess proper redundancy is having different brands of equipment also in some cases.

replies(6): >>AceJoh+La >>eganis+Wa >>davedu+lc >>toast0+8k >>atheno+Gl >>mceach+Wo

>>joshst+W
HN is where I get the bulk of my news and information about current thinking. I visit maybe several dozen times a day.

>>m4tthu+Z1
I set the delay to 1 minute: Short enough that I can wait if I really need to read a thread, but long enough to nudge me back to my primary task if I’m just browsing.

>>cables+E
What? I refreshed way more than 4 times before I believed it was offline.

At the early 00's, when Google went offline I wouldn't believe it, and go check my connection (even if I was fetching other sites at the same time). Looks like nowadays HN is in that place.

>>joshst+W
went out of my way to leave lurk mode and log in just so I could say "100% same experience for me"

>>clinto+4a
I don't know about literature, but in the world of RAID this is a common warning.

Having a RAID5 crash and burn because the backup disk failed during the reconstruction phase after a primary disk failed is a common story.

>>tpmx+(OP)
Thank you HN admins for bringing the site back online with everything restored from backup. Thank you also for the 99.99% of the time that HN just runs and runs without issue.

>>clinto+4a
Not sure about literature, but past anecdotes and HN threads yes.

https://news.ycombinator.com/item?id=4989579

>>spiffy+c7
Yeah just coming here to say this. Multiple disk failures are pretty probable. I've had batches of both disks and SSDs with sequential serial numbers, subjected to the same workloads, all fail within the same ~24 hour periods.

replies(2): >>mpyne+Ad >>schroe+mf

>>joshst+W
I reference back to it for a lot of info too, which I guess I should probably load more of into my own notes database. But still today there were a bunch of saved comments I wanted to re-read as reference multiple times, definitely noticeable to miss it. Or alternatively if I'd grabbed the URLs for everything I'm assuming the wayback machine probably archives this pretty well. Perils of depending on the HNcloud service :).

>>avalys+e4
What I am saying is that you don't need to worry about any of that. Sure, if HN permanently shutters - you'll need to go find a replacement. But HN isn't going anywhere, as far as I am aware. You don't need redundancy for your online community/content consumption

>>tpmx+(OP)
It might just be my devices (iPhone, iPad), but the fonts for stories and comments have changed since the restore. Anyone else noticing this?

>>tpmx+(OP)
When HN went down I was reading the stuff that was up before it went down on https://remix.hnclone.win

GitHub page for that project: https://github.com/clintonwoo/hackernews-remix-react

>>christ+o7
Might have tried to curl it from a production system just to make sure it wasn't my internet.

replies(1): >>hinkle+9d

>>bob102+b1
> Even accounting for this outage, most other SAAS platforms still can't compete with HN's non-existent SLA.

8 hours of downtime in a given year is 99.9%, so only three nines. The major SaaS platforms all are basically at least as resilient as this, and most have more stringent SLAs.

replies(1): >>metada+Ef1

>>christ+o7
Indeed. I find that it's so reliable and fast that I use it to check my internet connection. If I can't hit HN, then something's wrong on my end.

replies(3): >>40four+ui >>huevos+qk >>andrep+bu

>>slateg+Y3
Even if it did. Second browser will help with the urge. But it can help anyway. Some people just need the reminder.

>>clinto+4a
I hadn't heard of it either until disks in our storage cluster at work started failing faster than the cluster could rebuild in an event our ops team named SATApocalypse. It was a perfect storm of cascading failures.

https://web.archive.org/web/20220330032426/https://ops.faith...

replies(1): >>Flott+wl

>>tpmx+(OP)
thanks!

>>elcapi+8

   127.0.0.1   news.ycombinator.com   # REQUIRED for meaningful productivity

>>tpmx+(OP)
You don't say.

>>devonb+Lb
I understood this reference.

replies(1): >>867-53+kh

>>sillys+z
I'm extremely curious about the makes & models of the failed hardware...

>>bragr+Ya
Seems like it was only a few days ago that there was a comment from a former Dropbox engineer here pointing out that a lot of disk drives they bought when they stood up their own datacenter had been found to all have a common flaw involving tiny metal slivers.

>>tpmx+(OP)
Hypothesis: a link from HN hit the front-page of HN and inception ensued. In short, HN got HN'd.

>>tpmx+(OP)
I really enjoyed the outage.

Disclaimer: I did not cause it.

>>tpmx+(OP)
> HN Is Up Again

I'm not sure I believe you what is your source for this ;)

>>tpmx+(OP)
grats on the reload

>>chriss+48
Same. HN is how I check my internet connection is working. Took me a while to realize the problem wasn't my connection...

>>pyb+u3
AWS also operates at a significantly larger scale. When was the last AWS outage due to two critical disks failing at the same time?

replies(1): >>hosteu+2o

>>Amfy+o8
Oh that is interesting, I guess they just spun up a beefy EC2 instance. I'm noticing slower performance, I used to get about <200ms for front page. Now it's 500ms-1s? Or is this placebo with my bias to thinking AWS is slow?

    NetRange:       50.112.0.0 - 50.112.255.255
    CIDR:           50.112.0.0/16
    NetName:        AMAZON-EC2-USWESTOR
    NetHandle:      NET-50-112-0-0-1
    Parent:         NET50 (NET-50-0-0-0-0)
    NetType:        Direct Allocation
    OriginAS:       AS14618
    Organization:   Amazon.com, Inc. (AMAZO-47)

replies(1): >>double+Cu

>>anewpe+Y1
a call to go back to sleep for others

>>tpmx+(OP)
For me, HN is one of the default pages when browser's open.

>>BbzzbB+F3
Ah the elusive PorN method

replies(1): >>BbzzbB+FE

>>bragr+Ya
Had the same experience with (identical) SSDs, two failures within 10 minutes in a RAID 5 configuration.

(Thankfully, they didn't completely die but just put themselves into read-only)

>>rr808+o3
I've wondered about this a lot recently as I've spent a lot of time recently fighting with Terraform, Atlantis, CircleCI, github and AWS. The first half of my career was deploying code to various UNIX machines. When there was a problem I could login to that machine and use various (common) tools to diagnose the issue. I may not have had root but I was at least able to gain insight into what was likely the culprit. The interface was immediate and allowed quick iteration to test out a solution.

It feels like we've lost a lot of that observability and immediacy with the cloud. It's not as easy to quickly understand the larger picture. You can understand the state of various services with the web console or command line tools but tracing a path through those services is much less obvious and efficient.

I'm kind of nervous to even discuss this as I wonder if it's just my age showing, especially since I see very people mention this as one of the downsides of various cloud solutions. Maybe I'm just jaded?

replies(2): >>rr808+dj >>bbkane+dz

>>sillys+q8
Slashdot has been around since 1997 and people still rave about its moderation system today. However, while I have high hopes for HN, it could very well go the way of digg overnight

replies(1): >>GekkeP+fv

>>tpmx+(OP)
What happened? wasn't resolving at all.

>>sillys+q8
The reason it's an institution is because it hasn't been bought by some corp trying to squeeze value out of eyeballs, which is why it hasn't really changed much.

However, it takes money and time to keep it around in a not for profit way, so it will be an institution as long as it's funding is the same.

replies(1): >>GekkeP+vw

>>swyx+X2
And Mod; hope he gets three cheques

>>hinkle+9d
I didn't wget it

>>tpmx+(OP)
Whew!

Thanks Dang and company.

I appreciate you all.

@my HN peers:

Have a great weekend and thank you all for being you. I learn a ton here and enjoy the perspectives often found on these pages. It is all high value.

>>tpmx+(OP)
I'm sorry... I tripped.

>>albert+01
Ah, that's easy.

HN is running on an old laptop from Viaweb.

Arc is running under the pg user and it's used as the process supervisor.

The actual web server is a VB app running on Linux through Wine.

The flat files have been migrated to an MS Access DB, also running through Wine.

replies(1): >>fartca+1r

>>tpmx+(OP)
I've been noticing internet problems here and there all week, and was getting a little sketched out (is the heat waves, Russia, or alien attack:) so this really got me worried.

>>joncp+Ob
I assumed at first there was a problem with my mobile data connection, before I realized it was actually HN :)

>>joshst+W
This happened last couple of times I switched laptops - my old habit to visit "guardian.co.uk" by typing "guar" and hitting enter no longer works because I've now accidentally searched too many times for "guar" :D

replies(1): >>johnny+wn

>>boolea+s1
Same

Early on /. was amazing! Remember Cmdrtaco it all out, often taking us for the ride?

Good times, frequently good discussion.

HN has been better for years now, was better at inception, for the most part.

/. has improved a bit. Good to see, or I caught it on a good day.

>>tpmx+(OP)
HN is my default 'is the internet working' site and that genuinely threw me for a loop across multiple devices today dealing with hotspots while our power was out.

>>geoffe+bg
Lol, same boat. Not sure if I'm the old wise guy who really knows his sh*t about what's important, or the old useless guy rambling and moaning in the corner.

>>tpmx+(OP)
granted, this is probably the most exciting post here all day!

>>clinto+4a
I also don't know about literature on this phenomenon, but i recall HP had two different SSD recalls because when the uptime counter rolled over, they would fail. That's not even load dependent, just did you get a batch and power them on all at the same time. Uptime is too high causing issues isn't that unusual for storage, unfortunately.

It's not always easy, but if you can, you want manufacturer diversity, batch diversity, maybe firmware version diversity[1], and power on time diversity. That adds a lot of variables if you need to track down issues though.

[1] you don't want to have versions with known issues that affect you, but it's helpful to have different versions to diagnose unknown issues.

replies(1): >>GekkeP+Ht

>>joncp+Ob
Same here. I thought the my internet was down or I was too far from the router. HN being down was my last thought.

>>drpgq+P
Same here. HN is from Canada? Who knew?

>>joshst+W
Same experience through out the day.

>>davedu+lc
Great read, thank you!

>>clinto+4a
Not sure about literature but that was a known thing in the Ops circles I was in 10 years ago: never use the same brand for disk pairs, to minimize wear-and-tear related defects from arising at the same time.

replies(1): >>cestit+PA4

>>tpmx+(OP)
Great! Just in time for the Friday evening to kick in as well.

I was worried that I may actually have to go out and do things instead of lurking here this weekend..

>>digita+q5
14 hours ago HN failed over to the standby due to a disk failure on the primary. 8 hours ago the standby's disk also failed.

Primary failure: https://news.ycombinator.com/item?id=32024036 Standby failure: https://twitter.com/HNStatus/status/1545409429113229312

>>Mixtap+p4
I picked a bad day to stop using Twitter.

>>smcl+Ei
You can make the omnibox forget about URLs and search terms you've used a lot by selecting them with the down key then pressing Shift+Delete (https://superuser.com/a/189334).

replies(2): >>GekkeP+4t >>sokolo+6y

>>tpmx+(OP)
I have a flakey internet connection. Because HN loads so fast, I use it to test if my internet is working/is back up after it disconnects.

WELL TODAY WAS VERY INCONVENIENT LET ME TELL YOU! :)

>>mpyne+ge
Failures due to increasing complexity are still failures.

replies(1): >>mpyne+nq

>>tpmx+(OP)
Well on the plus side I got a lot done today

>>clinto+4a
Wikipedia has a section on this. It's called "correlated failure." https://en.wikipedia.org/wiki/RAID#Correlated_failures

>>tpmx+(OP)
Didn't even notice

>>sillys+z
According to this comment: https://news.ycombinator.com/item?id=32024485

each server has a pair of mirrored disks, so it seems we're talking about 4 drives failing, not just 2.

On the other hand the primary seems to have gone down 6 hours before the backup server did, so the failures weren't quite simultaneous.

replies(1): >>dang+ze1

>>christ+o7
Me too! I was, shamefully, in the middle of work so had a mini panic thinking my 5 min HN scroll was gonna become an hour long battle with my connection!

>>hosteu+2o
Sure, but you're talking about all of AWS as if every customer is impacted when any part of AWS suffers a failure. But that's not the case, which makes it quite an apples/oranges comparison.

But even comparing the apples to the oranges, this HN status page someone else pointed out https://hn.hund.io/ seems to show that HN has had more than one outage in just the past month. All but today's and last night's being quite short, but still. Sometimes you need some extra complexity if you want to make it to zero downtime overall.

That's not something the HN website needs but I think AWS is doing fine even if that's your point of comparison.

>>haswel+Y
Quality, non-biased news sources are (surprisingly) difficult to find.

replies(1): >>TameAn+yr

>>christ+o7
Habitual usage doesn't necessarily correlate to quality, I'd say. People who use Facebook/Twitter also have this sort of muscle reflex developed over time.

That said, HN does have quality content and the signal/noise is way better than sites designed specifically to keep you addicted.

replies(2): >>autoex+Uw >>iratew+3l1

>>oblio+Nh
My employer is taking notes.

replies(1): >>tbyehl+pI

>>joshst+W
> Needless to say I opened a new tab, typed "n", and hit enter countless times today before my brain caught up with my muscle memory.

I do this too, and it's because this site is an addictive slot machine just like every other social networking site. I actually really hate this website, but I'm here almost every day, because I can't seem to break the habit. Neat. It's probably because I have a common impulse control / executive functioning disorder, and the way the front page works exploits some bug in my brain.

Reddit does this to me too. I also hate Reddit.

replies(7): >>autoex+oy >>bbkane+wy >>msrene+5B >>dec0de+fD >>sgalla+DV >>greeni+s01 >>colinm+ss1

>>olinge+S2
It seems like it is in the cloud (AWS) now. See https://news.ycombinator.com/item?id=32027091.

>>adamre+Nq
Danger: that is decidedly not what HN is, at all.

>>cables+E
My RSS app refreshes every hour.

>>pyb+u3
Agree. Same with places like github.

>>Mixtap+p4
As far as addictions go, I find HN actually one that delivers actual knowledge. Literally every day I read something I didn't know before. Unlike on Facebook that just tries to serve me with more of the stuff I have already seen.

I like wasting time on HN because it's time not actually wasted :)

And don't get me started on Twitter... Sure there are some gems on twitter but I have to wade through 1000s of tweets of pure nonsense to see them. No thanks. If it's something really great someone will post a link on HN anyway :)

replies(1): >>Pebble+PL

>>johnny+wn
Nice, doesn't seem to work on firefox though, guess it's chrome only?

>>spiffy+c7
Eek - now I'm glad I wait a few months before buying each disk for my NAS.

Not doing it for this reason but rather financial ones :) But as I have a totally mixed bunch of sizes I have no RAID and a disk loss would be horrible.

replies(1): >>bragr+tR

>>toast0+8k
The crucial M4 had this too but it was fixable with a firmware update.

https://www.neoseeker.com/news/18098-64gb-crucial-m4s-crashi...

replies(1): >>toast0+QW

>>TIPSIO+11
Slashdot is a slurry of Reddit and HN. My goto when bored with reddit and hn...

>>chippi+F5
Isn't that more for load balancing than failover?

For load balancing I would consider this very likely because both are equally loaded. But "failover" I would usually consider a scenario where a second server is purely in wait for the primary to fail, in which case it would be virtually unused. Like an active/passive scenario as someone mentioned below.

But perhaps I got my terminology mixed up. I'm not working with servers so much anymore.

replies(1): >>0xbadc+mD

>>joncp+Ob
I use example.com for that purpose :p

replies(1): >>autoex+0w

>>tpmx+(OP)
HN is so reliable that I distrusted my wife first.

"Did you unplug the router?"

>>tpmx+(OP)
I read an article recently on avoiding fallback in distributed systems.[0]

Is it more appropriate to call the strategy in this case fallback, or failover? Since the secondary server wasn't running in production until the first one failed, it sounds like fallback?

Perhaps higher reliability strategies would have been instead of having a secondary server, just have more mirrored disks on the main server, to reduce the likelihood of the array being compromised?

Alternatively, to run both the primary and secondary servers in production all the time. But that would presumably merely move the single point of failure to the proxy?

[0] https://aws.amazon.com/builders-library/avoiding-fallback-in...

>>bestin+se
I hope it's temporary. Would hate HN to move to the "cloud" from bare metal.

replies(2): >>GekkeP+vx >>dang+Tr3

>>christ+o7
Haha. A building of ours in Canada lost internet at the same time.

When I saw hn was down, I double-checked the news to see if a major part of the internet had gone down.

>>grepfr+cg
I doubt that though. Digg was hyped way too much and the inevitable decline that comes after a hype killed it. Some things are good enough to survive that phase but Digg wasn't. HN never had a hype phase, just slow but strong & steady growth. And not growing too much either.

It seems the perfect circumstances to really last. It doesn't have an invasive business model, or investors screaming for ROI either. That's the kind of thing that often leads to user-hostile changes that so often start the decline into oblivion.

Also, I would imagine it's pretty cheap to host, after all it's all very simple text, I don't think it hosts any pictures beside the little Ycombinator logo in the corner :)

replies(1): >>pmoria+tP1

>>davedu+b2
I once had a small fleet of SSDs fail because they had some uptime counters that overflowed after 4.5 years, and that somehow persistently wrecked some internal data structures. It turned them into little, unrecoverable bricks.

It was not awesome seeing a bunch of servers go dark in just about the order we had originally powered them on. Not a fun day at all.

replies(2): >>mikiem+Lb1 >>rbanff+a35

>>olinge+S2
If I was (re)designing this, I would keep the existing bare metal server but I would also put in place double (or triple) cloud redundancy/failover. We all love HN so much that it should have zero downtime :-)

>>andrep+bu
Wow, it works, but it really seems like it shouldn't. I'd expected reserved domain names to not resolve at all let alone be pingable and point at a working webserver. Has it always had a website?

replies(1): >>KMnO4+OB

>>jethro+Vg
Yeah I really hope that if Ycombinator ever wants to pull out, that they don't sell it but let the community pull together to support it. I'd gladly donate to keep it running as it is.

It would be even better if they just keep doing it as they are though <3

replies(1): >>Sohcah+ol8

>>tpmx+(OP)
Prove it

>>butter+Xq
It might not correlate to quality, but if the information found at a website wasn't valued we wouldn't be constantly pulling the site up, just like facebook users do.

I'd argue that this site has a good signal/noise ratio by design and specifically to keep you addicted (where "addicted" means using and constantly returning to the site). This site is just designed to attract people who are put off by the kinds of tricks employed elsewhere

replies(1): >>freemi+0G

>>humanw+M1
I always use https://isup.me (https://downforeveryoneorjustme.com/) in these situations, to find out if I'm losing my mind. :)

>>double+Cu
This, totally.

It's great that they were able to spin it up in the cloud for recovery purposes. But it's more legendary on a real server <3

Yes I'm old :P

>>tpmx+(OP)
Mr. dang, can I use the opportunity to suggest to turn the tiny upvote and downvote arrows into links, separated by sufficient distance? My fingertip is 15x larger than these arrows and it takes quite a bit of precision to hit the right one. I bet, half of upvotes and downvotes are erroneous for this reason.

replies(1): >>mkl+GP

>>johnny+wn
Note that that doesn’t seem to work if you have a bookmark with that content (as it seems to find the bookmark, which is reasonable behavior but caught me out when I was trying to change the URL to an internal tool and didn’t realize why it wasn’t working to delete the auto-complete).

>>cheese+cr
> I actually really hate this website

Why hate this site? Because it contains interesting/useful content often enough to make you come back? That'd be a weird reason to hate the site. I too have a common impulse control/executive functioning disorder, but I don't hate the things that it makes me vulnerable to. If I were feeling resentful, I'd have to put the blame on my condition.

I don't have to ask why you hate reddit, the valid reasons for hating reddit are myriad

>>cheese+cr
Others have mentioned browser add-ons / DNS providers who can limit/blacklist sites. Maybe try one of those? The thing that's worked best for me though is leaving my phone in another room for a while or taking a walk without it.

>>geoffe+bg
Oh my goodness yes. I had the "great" idea to use Azure Functions to do a task at work. It's **ing insane how difficult it is to specify an Azure Function all in code with reasonable CI/CD, AD permissions, logging, and dev/prod instances. I wrote about what it takes at https://www.bbkane.com/blog/azure-functions-with-terraform/ but the experience really soured me on cloud services.

replies(1): >>jiggaw+zY

>>tpmx+(OP)
Damn, and I got a LOT done today.

>>tpmx+(OP)
I just figured my internet wasn't working! It's a testament to how well this site works!

>>joshst+W
Ohh... I thought my phone was just being weird because I use an AdBlock that works through VPN and it sometimes breaks stuff...

>>cheese+cr
If it helps, I wouldn't say it's a disorder since it appears that basically everyone has a habit like this. It's probably a byproduct of some kind of adaptive advantage, but I don't have it in me to speculate exactly what at the moment. The only variable is what exactly you do automatically. Nowadays, everyone has their app or web page. Before smartphones and the internet being available everywhere, I remember my mentor talking about quitting cigarettes. This was shortly after the non-smoking section of the restaurant became the whole restaurant. She said that part of why it was so hard to quit was that even when she meant to cut back, she'd still find herself a third of the way through a cigarette before she realized that she'd lit one. I tear at the skin next to my fingernails in addition to opening HN (which was what I switched to when it became painfully obvious that Reddit was both bad for me and run by bad people). I moved my ebook app to the first screen on my phone and moved this app to a spot where I wasn't used to finding it. I figured it might get me to read more. What actually happened is that I started absent-mindedly swiping to the second screen and opening up the app.

It's a pretty universal issue. Companies are just getting better at using it to their advantage.

>>tpmx+(OP)
Echoing lots of comments here, HN being down is a real bummer to my day. I feel less informed, less engaged, and frankly bored without it in my life.

True appreciation to the team who works to keep it up, high-quality, and impactful.

>>joshst+W
Yeah. HN has become an important part of my workflow. Glad to know I'm not the only one!

>>haswel+Y
I don't know about this "going out", but a few other useful websites like this one would be nice. It really does seem bad to not have a couple alternatives on hand.

>>autoex+0w
Yes, it’s always existed. It actually belongs to the IANA and if you Whois it you get: RESERVED-Internet Assigned Numbers Authority

replies(1): >>autoex+6E

>>tpmx+(OP)
Phewwwwwww, I seriously haven't been this worried since What.CD disappeared.

>>davedu+b2
Even if they're not the same, they're written at the same time and rate, meaning they have the same wear over time, subject to the same power/heat issues, etc.

replies(1): >>pmoria+5O1

>>TIPSIO+11
I used to spend a lot of time there in the 2000s, but when they changed the site's layout and it suddenly needed a bunch of JS I stopped going. Maybe I'll give it a shot again though... the outage made me think I need more options.

>>joshst+W
Wow, I've never realized just how often I do this. There's some sort of a reward pathway in my brain connecting my right index finger to seeing that orange bar and lines of text.

>>tpmx+(OP)
let's not forget that aside from content signal-to-noise ratio of the links, HN comments provide nearly half of the value because one can see the "other" side of the "story", sort of like visceral critical thinking...

>>cheese+cr
Try turning on noprocrast in your hn settings

>>GekkeP+8u
If it's active/active failover then they get the same wear, if it's active/passive most of the components don't, but the storage might. Then again if it's active/passive, flaws can "hibernate" and get triggered exactly when failing over.

You know how they say to always test your backups? Always test your failover too.

>>davedu+b2
I learned this principle by getting a ticket for a burnt out headlight 1 week after I replaced the other one.

replies(1): >>hallwa+qh1

>>tpmx+(OP)
I was hoping they just screwed up dns for a short time while launching hacker news ipv6 support!

>>KMnO4+OB
If they're running a mail server too I'm guilty of sending them a lot of spam. randomname@example.com is what I've always used for people unnecessarily requesting an email addresses

replies(1): >>LtdJor+NN

>>grepfr+jf
My Firefox on Manjaro defaulted to CTRL-SHIFT-p to open an incognito window, took me a minute to unlearn CTRL-SHIFT-n, but I figured I can't have the only PC with that hotkey.

>>autoex+Uw
> I'd argue that this site has a good signal/noise ratio by design

Or so we tell ourselves.

replies(2): >>hoosie+4P >>steven+yg1

>>tpmx+(OP)
Hackernews, still using ancient tech, what a joke. It's trivial to host on a cloud provider and get 100% uptime for pennies.

replies(1): >>lta+dM

>>tpmx+(OP)
Things learned today: HN is not listed on DD ... https://downdetector.com/search/?q=hacker+news

replies(1): >>s1mon+5Y

>>olinge+S2
HN was down for hours, no website hosted properly using cloud providers is down for more than a few minutes a year. It's trivial to set up multiple providers, multiple regions. Rather than having a few servers with some admin guy swapping out disks, really embarrassing for a so called tech site.

replies(1): >>coryrc+cu2

>>tpmx+(OP)
HN just lost 10% of it's users. Being down for hours is unforgivable in this day and age with cloud providers. Please get with the times, your a tech site for god sake!!!

replies(2): >>gitgud+4J >>lta+YL

>>fartca+1r
Even IBM is leaving Notes.

>>tpmx+(OP)
Must admit it was really hard to find out what was going on. I stumbled upon a twitter link somewhere deep down a list of search results. Doe HN have a status page of any kind?

replies(1): >>bolasa+JV

>>lambda+TH
> Being down for hours is unforgivable in this day and age with cloud providers.

Wasn't cloud flare down for a few hours recently... Cloud providers don't magically fix outages...

>>tpmx+(OP)
Most productive day I’ve had in a while (I kid).

>>GekkeP+ys
I find it valuable in moderation but some days I spend way too much time on it; past the point of diminishing returns. There's also an opportunity cost of what I could be doing with a lot of the time, which often would leave me feeling better than mindless HN reading.

>>lambda+TH
Cloud providers mostly suck. Not that I deserve to feel that, I feel proud that HN was/is run from bare metal

>>hguant+p7
007

>>lambda+rG
My dear Sir/Madam,

Would you please stop spamming your opinion ? You wrote it once, it's enough.

Thank you very much <3

>>davedu+b2
Or even if the power supplies were purchased around the same time. I had a batch of servers that as soon as they arrived started chewing through hard drives. It took about 10 failed drives before I realized it was a problem with the power supplies.

>>pyb+u3
Brief AWS outages were limited to us-east-1 where they appear to deploy canary builds and I think they quickly learned from those missteps. OTOH I receive almost weekly emails on my oracle cloud instance connectivity being down. I don’t even understand who their customers are that can tolerate frequent outage

Edit - HN is on AWS now. https://news.ycombinator.com/item?id=32026571

>>autoex+6E
Probably drops everything

>>nominu+n7
it would be super ironic if reddit acquired it somehow

>>freemi+0G
Internal polling shows that we are all here by merit and merit alone!

>>akomtu+Yx
The "undown" or "unvote" link that appears tells you which one you hit. I appreciate how little space they take up. If you're doing it with touch just zoom in.

>>tpmx+(OP)
Appears HN is currently on AWS:

https://check-host.net/ip-info?host=https://news.ycombinator...

http://50.112.136.166/

https://search.arin.net/rdap/?query=50.112.136.166

Note: HN has been on M5 hosting for years and they were still there as of 16-hours ago per Dang:

https://news.ycombinator.com/item?id=32024105

During the outage, listed places to check HN related systems, posted them here:

https://news.ycombinator.com/item?id=32029014

>>GekkeP+rt
Have to be careful doing that too or you'll end up with subtly different revisions of the same model. This may or may not cause problems depending on the drives/controller/workload but can result in you chasing down weird performance gremlins or thinking you have a drive that's going bad.

>>joshst+W
That used to happen to me with Slashdot too. I'm so happy that today I didn't even realize HN was down, I think I tried once, it didn't load and went do something else, I assumed it was some local DNS or internet issue.

>>christ+o7
I bet thousands of people checked their network and DNS settings today. hn is what I use to make sure I have a connection

replies(1): >>rajama+dT

>>hooand+oS
Same here, it used to be google but at some point it still just starting loading with no internet connection.

replies(1): >>Ruphin+A01

>>tpmx+(OP)
oh thank god you put this here, I was unable to determine this for myself.

I'm sorry, this kind of thing reeks of point "whoring" to me, and I consider that to be an indefensible thing to do; it's pollution. We can see the site. We know it's up. We don't need to be told. Stop doing things purely to increase your score. This isn't a game. Etc.

>>cheese+cr
I really love HN but I too feel like it's an addiction/slot machine.

My solution: a 3-hour focus mode browser extension.

1. Install the BlockSite chrome extension [1]. 2. In BlockSite settings, add HN, Twitter, and any other distracting sites to the Focus Mode list and set Focus Mode time to 3 hours. 3. Ensure you uninstall all social media apps from your phone 4. When I find myself opening a new tab and typing "n" to get a dopamine hit, I then turn on my 3-hour focus mode.

[1] https://chrome.google.com/webstore/detail/blocksite-block-we...

>>booste+0J
I believe this is the official twitter account: https://twitter.com/HNStatus

>>GekkeP+Ht
That one looks not too bad, seems like you can fix it with a firmware update after it fails. A lot of disk failures due to firmware bugs end up with the disk not responding to the bus, so it becomes somewhere between impossible and impractical to update the firmware.

>>jonbae+NG
https://twitter.com/HNStatus

>>bbkane+dz
Is this... a deliberate attempt at constructing a Rube Goldberg machine?

In all seriousness, at least 2/3rds of the complexity is because of your choice of tools and approach. Terraform alone makes things significantly more complex. If you just want to trigger a deployment, then a Template Spec made from a Bicep file could be banged out in like... an hour.[1]

When in Rome, do as the Romans do. You basically took a Microsoft product and tried to automate it with a bunch of Linux-native tools. Why would you think this would be smooth and efficient?

Have you ever tried automating Linux with VB Script? This is almost the same thing.

[1] Someone had a similar example here using Logic Apps and a Template Spec: https://cloudjourney.medium.com/azure-template-spec-and-logi...

replies(2): >>bbkane+801 >>dang+Iq3

>>pastor+x1
I push changes on my own all the time, but the work of getting HN running again today was overwhelmingly done by my brilliant colleague mthurman.

>>jbvers+G2
8 hours of downtime, but not data loss, since there was no data to lose during the downtime.

Last post before we went down (2022-07-08 12:46:04 UTC): https://news.ycombinator.com/item?id=32026565

First post once we were back up (2022-07-08 20:30:55 UTC): https://news.ycombinator.com/item?id=32026571 (hey, that's this thread! how'd you do that, tpmx?)

So, 7h 45m of downtime. What we don't know is how many posts (or votes, etc.) happened after our last backup, and were therefore lost. The latest vote we have was at 2022-07-08 12:46:05 UTC, which is about the same as the last post.

There can't be many lost posts or votes, though, because I checked HN Search (https://hn.algolia.com/) just before we brought HN back up, and their most recent comment and story were behind ours. That means our last backup on the ill-fated server was taken after the last API update (HN Search relies on our API), and the API gets updated every 30 seconds.

I'm not saying that's a rock-solid argument, but it suggests that 30 seconds is an upper bound on how much data we lost.

replies(1): >>sillys+j31

>>jiggaw+zY
jiggawatts, this is an honest attempt. I'd LOVE it if there's an easier way I somehow missed.

And talk is cheap. I dare you to write a blog post or make a public GitHub repo doing the equivalent work (see Goals section) with your own tools. If you can, I'll be super impressed (not that my admiration is worth anything ).

One thing you'll run into is that AD roles and other authn aren't accessible via ARM templates/Bicep

replies(1): >>jiggaw+h72

>>christ+o7
> For a while, I thought there must be something wrong with my internet connection or DNS config or something.

Definitely thought the same. Then I realized that I'm browsing trough work VPN and had a second thought: what if our admins decided to fight procrastination?

>>cheese+cr
i got addiction problems and this is the only website that's healthy for me. there's no endless scroll if you just visit the "news" and don't go to the "newest" page. i like how it's intentional to go to the next page. i usually only click through like ~5 pages at most, and once i've visited ~10-20 times in a day, most all the content is stale. it's also been helpful for me to set the procrastination limits. so many times i'll visit and can't scroll and just move on.

with that said, the comments are the most addictive part of this site.

>>rajama+dT
Easy to make sure you're not seeing some cached version by doing a search for random string from mashing your keyboard. These days literally any string of characters shorter than 10 will find some sort of result and there's very little chance you have it cached somehow.

>>tpmx+(OP)
I slept through the downtime on the other side of the planet.

>>sillys+z
> Double disk failure is improbable but not impossible.

Were they connected on the same power supply? I had 4 different disks fail at the same time before, but they were all in the same PC... (lightning)

replies(1): >>mikiem+3i1

>>dang+SZ
Curiosity got the better of me. Why was there a 6 ID gap between the last post and first post? The answer seems to be that admins were making posts, which is neat. (There was also one lonely Flexport job ad.)

Is your backup system tied to your API? Algolia is a third party service, and streaming the latest HN data to Algolia seems pretty similar to streaming it to a backup system.

replies(2): >>dang+v31 >>swyx+Qd1

>>sillys+j31
I posted a bunch of test things and then deleted them.

replies(1): >>scott_+ih2

>>tpmx+(OP)
Prove it.

>>olinge+S2
I made the comment that some of our web portals could run off of a raspberry pi perfectly fine, and I wasn’t necessarily suggesting we go do that, but merely trying to get the point across that we don’t need 700 interweaved AWS systems to do what a single host with Apache + Postgres has been doing fine for years.

>>tpmx+(OP)
Maybe I'm missing something, but what's the point of posting this here? Anyone will already know it's back up before they see this post.

>>kabdib+iv
You are never going to guess how long the HN SSDs were in the servers... never ever... OK... I'll tell you: 4.5years. I am not even kidding.

replies(3): >>kabdib+md1 >>chinat+RC1 >>muttan+yp3

>>tpmx+(OP)
I thought HN is immortal.

Visiting a HN is everyday morning ritual to me.

>>mikiem+Lb1
Let me narrow my guess: They hit 4 years, 206 days and 16 hours . . . or 40,000 hours.

And that they were sold by HP or Dell, and manufactured by SanDisk.

Do I win a prize?

(None of us win prizes on this one).

replies(6): >>mikiem+lg1 >>dang+Kj1 >>Amfy+ok1 >>mkl+os1 >>agileA+2v6 >>pankaj+XP6

>>sillys+j31
i got that Flexport ad too.. haha kinda alarming if they are the only YC company still hiring

replies(1): >>dang+Nr3

>>tgflyn+Hp
> so it seems we're talking about 4 drives failing, not just 2.

Yes—I'm a bit unclear on what happened there, but that does seem to be the case.

>>digita+q5
The disks on both the primary and fallback servers definitely failed. Each was in a RAID setup, but those failed too in both cases.

replies(1): >>digita+Ni1

>>tpmx+(OP)
Clearly a correlation between 2 hard drives failing at HN [1] and the Canadian Rogers network outage [2].

[1] https://twitter.com/HNStatus

[2] https://www.reuters.com/business/media-telecom/rogers-commun...

>>vntok+Mb
Most SaaS platforms don't really measure it. They just "do their best".

Plus, it's hard to quantify many cases because there is hard-down and soft-down (partial interruptions).

>>kabdib+md1
These were made by SanDisk (SanDisk Optimus Lightning II) and the number of hours is between 39,984 and 40,032... I can't be precise because they are dead and I am going off of when the hardware configurations were entered in to our database (could have been before they were powered on) or when we handed them over to HN, and when the disks failed.

Unbelievable. Thank you for sharing your experience!

>>freemi+0G
I feel the voting process on HN deserves a lot of credit for the quality of front-page content. I wish a knew more about it, other than karma>500 giving the ability to downvote.

>>adrian+yD
Anyone familiar with car repair will tell you that if one headlight burns out you should just go ahead and replace both, because of this exact phenomenon. I suppose with LEDs we may not have to worry about it anymore

>>tpmx+(OP)
My first thought was “oh sh*t, they finally added this to the list of time-wasting social media blocked sites” :( It was only when I saw it also didn’t work on my phone, that I realized HN itself was actually down

>>Pakdef+u21
They were in two mirrors, each mirror in a different server. Each server in different racks in the same row. The servers were on different power circuits from different panels.

>>dang+Le1
Ouch! I'm assuming the disks were from the same batch and installed at the same time, but having at least four fail like that is just crazy unlucky.

replies(1): >>dang+nl1

>>kabdib+md1
Wow. It's possible that you have nailed this.

Edit: here's why I like this theory. I don't believe that the two disks had similar levels of wear, because the primary server would get more writes than the standby, and we switched between the two so rarely. The idea that they would have failed within hours of each other because of wear doesn't seem plausible.

But the two servers were set up at the same time, and it's possible that the two SSDs had been manufactured around the same time (same make and model). The idea that they hit the 40,000 hour mark within a few hours of each other seems entirely plausible.

Mike of M5 (mikiem in this thread) told us today that it "smelled like a timing issue" to him, and that is squarely in this territory.

replies(2): >>mikiem+nk1 >>tempes+BH1

>>dang+Kj1
This morning, I googled for issues with the firmware and the model of SSD, I got nothing. But now I am searching for "40000 hours SSD" and a million relevant results. Of course, why would I search for 40000 hours.

This thread is making me feel a lot less crazy.

replies(2): >>boulos+1t1 >>dredmo+x64

>>kabdib+md1
is this leased to HN as dedicated/baremetal servers or colocation aka HN owns the hardware?

replies(1): >>dang+sl1

>>butter+Xq
I think they meant quality from a hardware and ops perspective.

>>digita+Ni1
Veteran programmer and HN user kabdib has proposed a striking theory that could explain everything: https://news.ycombinator.com/item?id=32028511.

>>Amfy+ok1
The former.

>>tpmx+(OP)
Problem with HN being down is can't check HN for whether HN is down.

>>rco878+K1
I think that’s fair. Aside from the topics though I think the key is “meaningful discourse”.

I think I’m just realizing I need to find that in more places, regardless of topic focus.

>>kabdib+md1
I wonder if it might be closer to 40,032 hours. The official Dell wording [1] is "after approximately 40,000 hours of usage". 2^57 nanoseconds is 40031.996687737745 hours. Not sure what's special about 57, but a power of 2 limit for a counter makes sense. That time might include some manufacturer testing too.

[1] https://www.reddit.com/r/sysadmin/comments/f5k95v/dell_emc_u...

replies(2): >>boulos+Ns1 >>gomija+ug4

>>cheese+cr
I ended up just DNS blocking it on my work computer. Now I just spend a lot more time on my phone though :/

>>mkl+os1
See! People should register via mail for those important notifications! (Or alternatively do quarterly checks that your firmware is up to date).

replies(1): >>sqldba+AQ4

>>mikiem+nk1
I'm hoping that deep in your spam folder is a critical firmware update notice from Dell/EMC/HP/SanDisk from 2 years ago :).

>>Johnny+L2
The postmortem is sort of dissolved into the bloodstream of these threads:

Tell HN: HN Moved from M5 to AWS - >>32030400 - July 2022 (116 comments)

Ask HN: What'd you do while HN was down? - >>32026639 - July 2022 (218 comments)

HN is up again - >>32026571 - July 2022 (314 comments)

I'd particularly look here: >>32026606 and here: >>32031025 .

If you scroll through my comments from today via https://news.ycombinator.com/comments?id=dang&next=32039936, there are additional details. (Sorry for recommending my own comments.)

If you (or anyone) skim through that stuff and have a question that isn't answered there, I'd be happy to take a crack at it.

>>tpmx+(OP)
Imagine running a website that earns you millions and millions of dollars and it all gets taken out by a couple disk failures? Amateurs.

replies(1): >>CRConr+u09

>>mikiem+Lb1
How many other customers will/have hit this?

replies(1): >>qu1j0t+yu3

>>davedu+b2
There's a principle in aviation of staggering engine maintenance on multiple-engined airplanes to avoid maintenance-induced errors leading to complete power loss.

e.g. Simultaneous Engine Maintenance Increases Operating Risks, Aviation Mechanics Bulletin, September–October 1999 https://flightsafety.org/amb/amb_sept_oct99.pdf

>>dang+Kj1
This kind of thing is why I love Hacker News. Someone runs into a strange technical situation, and someone else happens to share their own obscure, related anecdote, which just happens to precisely solve the mystery. Really cool to see it benefit HN itself this time.

replies(1): >>dang+me3

>>spiffy+c7
That's why serious SAN vendors take care to provide you a mix of disks (e.g. on a brand new NetApp you can see that disks are of 2-3 different types, and with quite different serial numbers).

>>0xbadc+uC
Hopefully, regularly checking the disks' S.M.A.R.T status will help you stay on top of issues caused by those factors.

Also, you shouldn't wait for disks to fail to replace them. HN's disks were used for 4.5 years, which is greater than the typical disk lifetime, in my experience. They should have replaced them sooner, one by one, in anticipation of failure. This would also allow them to stagger their disk purchases to avoid similar manufacturing dates.

replies(1): >>justso+422

>>GekkeP+fv
Change of ownership is inevitable, as people don't live forever. When that happens, if the new owners aren't interested or motivated to keep funding HN, it could easily go the way of the dodo.

Hopefully archive.org is involved in archiving HN, though unfortunately archive.org's future itself is in jeopardy.

replies(1): >>GekkeP+ZZ1

>>sillys+z
Depends on your vendor as well.

A long time ago we had a Dell server which was pre setup raid from Dell (don't ask, I didn't order it). Eventually one disk on this server died, what sucked was that the second disk in the raid array also failed only a few minutes later. We had to restore from backup which sucked but to our surprise when we opened the Dell server the two disks had sequential serial numbers. They came from the same batch at the same time. Not a good thing to do when you sell people pre configured raid systems at a mark up...

>>pmoria+tP1
> though unfortunately archive.org's future itself is in jeopardy.

How so?? This is the first I've heard of it.

replies(1): >>pmoria+t32

>>pmoria+5O1
https://news.ycombinator.com/reply?id=32033520&goto=item%3Fi...

I've seen too many dead disks with a perfect SMART. When the numbers go down (or up) and triggers are fired then you are surely need to replace the disk[0], but SMART without warnings just means nothing.

[0] my desktop run for years entirely on the disks removed from the client PCs after a failure. Some of them had a pretty bad SMART, on a couple I needed to move the starting point of the partition a couple GBs further from the sector 0 (otherwise they would stall pretty soon), but overall they worked fine - but I never used them as a reliable storage and I knew I can lose them anytime.

Of course I don't use repurposed drives in the servers.

PS and when I tried to post it I received " We're having some trouble serving your request. Sorry! " Sheesh.

>>GekkeP+ZZ1
It's truly tragic how few people have heard of this. We should be doing all we can to raise awareness of this lawsuit before it's too late.

Here are some relevant links:

https://news.ycombinator.com/item?id=31703394

https://decrypt.co/31906/activists-rally-save-internet-archi...

https://www.courtlistener.com/docket/17211300/hachette-book-...

replies(1): >>GekkeP+5f2

>>bbkane+801
> AD roles and other authn aren't accessible via ARM templates/Bicep

I normally bill for cloud automation advice, but the gist is:

You can automate RBAC/IAM via Bicep or ARM[1], but only for existing groups or system managed identities or user managed identities. This usually covers everything that is typically done for cloud automation.

Note that the initial setup might require "manual" steps to set up the groups and their memberships, but then the rest can be automated. In other words, there's a one-time "prerequisites" step followed by 'n' fully automated deployments.

You can also use templates to deploy groups dynamically[2] if you really need to, but this ought to be rare. The problem with this is that templates are designed to deploy resources, and AAD groups aren't resources.

More generally, your mistake IMHO was to try to automate the automation itself, while side-stepping the Azure-native automation tooling by choosing Terraform+Functions instead of Template Specs with delegated permissions via Azure RBAC. Most of your template is used to deploy the infrastructure to deploy a relatively simple template!

This reminds me of people writing VB Scripts to generate CMD files that generate VB Scripts to trigger more scripts in turn. I wish I was kidding, but a huge enterprise did this seven levels deep for a critical systems-management processes. It broke, and caused massive problems. Don't do this, just KISS and remember https://xkcd.com/1205/

[1] via Microsoft.Authorization/roleAssignments

[2] via Microsoft.Resources/deploymentScripts

>>pmoria+t32
Thanks! I heard of this back in the day but I didn't know it reached the courts now. I feel this is a bit self-inflicted though. A valuable service like this is already under scrutiny for copyrights and playing chicken with big publishers with tons of money to spend on lawyers is a really bad idea. I'm also opposed to the way copyright works but I would separate that fight from the service.

I guess it got them some goodwill during Corona but it could cause more damage than it's worth.

I wouldn't have done it, it was not like it was a real value during the pandemic. Those who are really into books and don't care about copyright already know their way to more gray-area sites like LibGen.

>>dang+v31
I love this answer so much.

replies(1): >>sillys+ZW2

>>lambda+kH
Hope you get a refund

>>scott_+ih2
I really wanted to ask “How did you post things if the server was down?” but perhaps some things are better left as mysteries.

replies(2): >>O_____+K33 >>dang+2p3

>>sillys+ZW2
You could see them via HN’s API before they were deleted, nothing interesting; API was back up before the www.

replies(1): >>dang+ep3

>>tempes+BH1
It's also an example of the dharma of /newest – the rising and falling away of stories that get no attention:

HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours - https://news.ycombinator.com/item?id=22706968 - March 2020 (0 comments)

HPE SSD flaw will brick hardware after 40k hours - https://news.ycombinator.com/item?id=22697758 - March 2020 (0 comments)

Some HP Enterprise SSD will brick after 40000 hours without update - https://news.ycombinator.com/item?id=22697001 - March 2020 (1 comment)

HPE Warns of New Firmware Flaw That Bricks SSDs After 40k Hours of Use - https://news.ycombinator.com/item?id=22692611 - March 2020 (0 comments)

HPE Warns of New Bug That Kills SSD Drives After 40k Hours - https://news.ycombinator.com/item?id=22680420 - March 2020 (0 comments)

(there's also https://news.ycombinator.com/item?id=32035934, but that was submitted today)

replies(3): >>dredmo+E64 >>yahelc+pP4 >>winter+Fg6

>>sillys+ZW2
The server was up for us before it was up for everybody else.

>>O_____+K33
Good observation. Posting something and then seeing it show up in the API was one of the things we were testing. It exercises a lot of the code.

>>mikiem+Lb1
It's concerning that a hosting company was unaware of the 40,000 hour situation with SSD it was deploying. Anyone in hosting would have been made aware of this, or at least should have kept a better grip on happenings in the market.

replies(1): >>dogeco+wJ3

>>jiggaw+zY
> Is this... a deliberate attempt at constructing a Rube Goldberg machine?

> I normally bill for cloud automation advice, but the gist is

Can you please omit supercilious swipes from your comments here? Everybody knows different things. If you know more than someone else about $thing, that's great—but please don't put them down for it. That's not in the spirit of kindness and curious conversation that we're hoping for here.

https://news.ycombinator.com/newsguidelines.html

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...

>>swyx+Qd1
Btw, job ads get queued long in advance and then the software picks the next one when it's time for a job ad. After 8 hours of being down, the software thought it was time for a job ad.

>>double+Cu
Us too.

>>chinat+RC1
Every large DC will have hit it (Amazon, Facebook, Google, etc). But it's a shame that all their operational knowledge is kept secret.

replies(1): >>exikyu+en5

>>spiffy+c7
This is why I try to mismatch manufacturers in RAID arrays. I'm told there is a small performance hit (things run towards the speed of the slowest, separately in terms of latency and throughput) but I doubt the difference is high and I like the reduction in potential failure-during-rebuild rates. Of course I have off-machine and off-site backups as well as RAID, but having to use them to restore a large array would be a greater inconvenience than just being able to restore the array (followed by checksum verifies over the whole lot for paranoia's sake).

>>muttan+yp3
Yeah, this is why you run all equipment in a test environment for 4.5 years before deploying it to prod. Really basic stuff.

replies(1): >>muttan+7M3

>>dogeco+wJ3
The HD makers started issuing warnings in 2020... this was foreseeable

>>mikiem+nk1
There are times I don't miss dealing with random hardware mystery bullshit.

This one is just ... maddening.

>>dang+me3
Popularity is a very poor relevance / truth heuristic.

replies(1): >>gpshea+hE5

>>mkl+os1
It might not be nanoseconds, but something that's a power of 2 number of nanoseconds going into an appropriately small container seems likely. For example, a 62.5MHz counter going into 53 bits breaks at the same limit. Why 53 bits? That's where things start to get weird with IEEE doubles - adding 1 no longer fits into the mantissa and the number doesn't change. So maybe someone was doing a bit of fp math to figure out the time or schedule a next event? Anyway, very likely some kind of clock math that wrapped or saturated and broke a fundamental assumption.

replies(1): >>dreamc+Mi6

>>atheno+Gl
We used to use the same brand, but different models or at least ensure they were from different manufacturing batches.

>>dang+me3
Easy to imagine why this didn’t capture peoples’ attention in late March 2020…

replies(2): >>dang+4T5 >>mcv+jY6

>>boulos+Ns1
A lot of companies have teams dedicated to hardware that don’t give a shit about it. And their managers don’t give a shit.

Then the people under them who do give a shit, because they depend on those servers, aren’t allowed to register with HP etc for updates, or to apply firmware updates, because “separation of duties”.

Basically, IT is cancer from the head down.

>>kabdib+iv
I had a similar issue, but it was a single RAID-5 array and wear of some other manufacture defect. They were the same brand, model, and batch. When the first failed and the array got in recovery mode I ordered 3 replacements and upped the backup frequency. It was good that I did that because the two remaining drives died shortly after.

The lesson I learned is that the three replacements went to different arrays and we never again let drives from the same batch be part of the same array.

>>qu1j0t+yu3
I understand BackBlaze is more HDD rather than SSD, but perhaps they might have some level of awareness.

>>dredmo+E64
I wanted to upvote this comment but that just feels wrong.

replies(1): >>dredmo+866

>>yahelc+pP4
Yes, an enterprisey firmware update - all very boring until BLAM!

>>gpshea+hE5
You're a good man, Charlie Brown.

>>dang+me3
Interesting how something that is so specifically and unexpectedly devastating, yet known for such a long time without any serious public awareness from companies involved, is referred to as a "bug".

It makes you lose data and need to purchase new hardware, where I come from, that's usually referred to as "planned" or "convenient" obsolescence.

replies(1): >>mulmen+Vn6

>>gomija+ug4
53 is indeed a magic value for IEEE doubles, but why would anybody count an inherently integer value with floating-point? That's a serious rookie mistake.

Of course there's no law that says SSD firmware writers can't be rookies.

replies(1): >>lultim+3T6

>>winter+Fg6
The difference between planned and convenient seems to be intent. And in this context that difference very much matters. I wouldn’t conflate the two.

replies(1): >>winter+Vp6

>>mulmen+Vn6
Depends on who exactly we are talking about as having the intent...

Both planned and convenient obsolescence are beneficial to device manufacturers. Without proper accountability for that, it only becomes a normal practice.

replies(1): >>mulmen+6q6

>>winter+Vp6
> Depends on who exactly we are talking about as having the intent...

The manufacturer, obviously. Who else would it be?

Could be an innocent mistake or a deliberate decision. Further action should be predicated on the root cause. Which includes intent.

>>kabdib+md1
Bang on!

https://news.ycombinator.com/item?id=32048148

>>kabdib+md1
Do they use SSD on space missions aswell?

replies(1): >>galois+ze7

>>dreamc+Mi6
Full stack JS, everything is a double down to the SSD firmware!

>>yahelc+pP4
Was HN an indirect casualty of Covid?

>>pankaj+XP6
Only for 4 years, 206 days and 16 hours.

>>GekkeP+vw
Considering the hardware it's running on, it wouldn't cost much to keep it running.

replies(1): >>jethro+dU9

>>Thetaw+Mu1
Imagine running a website that doesn't earn you millions and millions of dollars and it gets taken out for a mere eight hours by a couple disk failures.

Perfectly acceptable.

>>Sohcah+ol8
More the labor I'd assume. Development to continue scaling, maintenance and moderation.