HN is up again - zlacker

>>tpmx+(OP)
HN was down because the failover server also failed: https://twitter.com/HNStatus/status/1545409429113229312

Double disk failure is improbable but not impossible.

The most impressive thing is that there seems to be no dataloss, almost whatsoever. Whatever the backup system is, it seems rock solid.

>>crorel+d
https://twitter.com/HNStatus/status/1545409429113229312

https://twitter.com/HNStatus/status/1545461511870566400

Disk and fallback server failure. Was definitely a long day for their ops team, on a Friday no less.

>>tpmx+(OP)
The last post[1] before this one was posted at 12:45:10 UTC. This current post was made at 20:30:55 UTC, so that's a gap of 7 hours 45 minutes and 45 seconds.

[1] https://news.ycombinator.com/item?id=32026548

>>thamer+f1
There's this comment which is newer: https://news.ycombinator.com/item?id=32026565. There's also comments on some test submission: https://news.ycombinator.com/item?id=32026568 that itself got deleted: https://news.ycombinator.com/item?id=32026567

>>tpmx+(OP)
Previous post on the HN status twitter account is from June 2021.

Over a year with no issues. Impressive.

https://twitter.com/HNStatus

>>tpmx+(OP)
Thank you Dang!!

____________

https://news.ycombinator.com/user?id=dang

>>tpmx+(OP)
I think I found the last one before the outage:

https://news.ycombinator.com/item?id=32026565

The ones after it are hours later and usually deleted, until this post (...71).

>>beckin+a2
Twitter doesn't bother with minor issues, but there were some for sure.

This logs lesser ones: https://hn.hund.io/

>>clinto+4a
Not sure about literature, but past anecdotes and HN threads yes.

https://news.ycombinator.com/item?id=4989579

>>tpmx+(OP)
When HN went down I was reading the stuff that was up before it went down on https://remix.hnclone.win

GitHub page for that project: https://github.com/clintonwoo/hackernews-remix-react

>>clinto+4a
I hadn't heard of it either until disks in our storage cluster at work started failing faster than the cluster could rebuild in an event our ops team named SATApocalypse. It was a perfect storm of cascading failures.

https://web.archive.org/web/20220330032426/https://ops.faith...

>>digita+q5
14 hours ago HN failed over to the standby due to a disk failure on the primary. 8 hours ago the standby's disk also failed.

Primary failure: https://news.ycombinator.com/item?id=32024036 Standby failure: https://twitter.com/HNStatus/status/1545409429113229312

>>smcl+Ei
You can make the omnibox forget about URLs and search terms you've used a lot by selecting them with the down key then pressing Shift+Delete (https://superuser.com/a/189334).

>>clinto+4a
Wikipedia has a section on this. It's called "correlated failure." https://en.wikipedia.org/wiki/RAID#Correlated_failures

>>sillys+z
According to this comment: https://news.ycombinator.com/item?id=32024485

each server has a pair of mirrored disks, so it seems we're talking about 4 drives failing, not just 2.

On the other hand the primary seems to have gone down 6 hours before the backup server did, so the failures weren't quite simultaneous.

>>hosteu+2o
Sure, but you're talking about all of AWS as if every customer is impacted when any part of AWS suffers a failure. But that's not the case, which makes it quite an apples/oranges comparison.

But even comparing the apples to the oranges, this HN status page someone else pointed out https://hn.hund.io/ seems to show that HN has had more than one outage in just the past month. All but today's and last night's being quite short, but still. Sometimes you need some extra complexity if you want to make it to zero downtime overall.

That's not something the HN website needs but I think AWS is doing fine even if that's your point of comparison.

>>olinge+S2
It seems like it is in the cloud (AWS) now. See https://news.ycombinator.com/item?id=32027091.

>>toast0+8k
The crucial M4 had this too but it was fixable with a firmware update.

https://www.neoseeker.com/news/18098-64gb-crucial-m4s-crashi...

>>tpmx+(OP)
I read an article recently on avoiding fallback in distributed systems.[0]

Is it more appropriate to call the strategy in this case fallback, or failover? Since the secondary server wasn't running in production until the first one failed, it sounds like fallback?

Perhaps higher reliability strategies would have been instead of having a secondary server, just have more mirrored disks on the main server, to reduce the likelihood of the array being compromised?

Alternatively, to run both the primary and secondary servers in production all the time. But that would presumably merely move the single point of failure to the proxy?

[0] https://aws.amazon.com/builders-library/avoiding-fallback-in...

>>humanw+M1
I always use https://isup.me (https://downforeveryoneorjustme.com/) in these situations, to find out if I'm losing my mind. :)

>>geoffe+bg
Oh my goodness yes. I had the "great" idea to use Azure Functions to do a task at work. It's **ing insane how difficult it is to specify an Azure Function all in code with reasonable CI/CD, AD permissions, logging, and dev/prod instances. I wrote about what it takes at https://www.bbkane.com/blog/azure-functions-with-terraform/ but the experience really soured me on cloud services.

>>tpmx+(OP)
Things learned today: HN is not listed on DD ... https://downdetector.com/search/?q=hacker+news

>>pyb+u3
Brief AWS outages were limited to us-east-1 where they appear to deploy canary builds and I think they quickly learned from those missteps. OTOH I receive almost weekly emails on my oracle cloud instance connectivity being down. I don’t even understand who their customers are that can tolerate frequent outage

Edit - HN is on AWS now. https://news.ycombinator.com/item?id=32026571

>>tpmx+(OP)
Appears HN is currently on AWS:

https://check-host.net/ip-info?host=https://news.ycombinator...

http://50.112.136.166/

https://search.arin.net/rdap/?query=50.112.136.166

Note: HN has been on M5 hosting for years and they were still there as of 16-hours ago per Dang:

https://news.ycombinator.com/item?id=32024105

During the outage, listed places to check HN related systems, posted them here:

https://news.ycombinator.com/item?id=32029014

>>cheese+cr
I really love HN but I too feel like it's an addiction/slot machine.

My solution: a 3-hour focus mode browser extension.

1. Install the BlockSite chrome extension [1]. 2. In BlockSite settings, add HN, Twitter, and any other distracting sites to the Focus Mode list and set Focus Mode time to 3 hours. 3. Ensure you uninstall all social media apps from your phone 4. When I find myself opening a new tab and typing "n" to get a dopamine hit, I then turn on my 3-hour focus mode.

[1] https://chrome.google.com/webstore/detail/blocksite-block-we...

>>booste+0J
I believe this is the official twitter account: https://twitter.com/HNStatus

>>jonbae+NG
https://twitter.com/HNStatus

>>bbkane+dz
Is this... a deliberate attempt at constructing a Rube Goldberg machine?

In all seriousness, at least 2/3rds of the complexity is because of your choice of tools and approach. Terraform alone makes things significantly more complex. If you just want to trigger a deployment, then a Template Spec made from a Bicep file could be banged out in like... an hour.[1]

When in Rome, do as the Romans do. You basically took a Microsoft product and tried to automate it with a bunch of Linux-native tools. Why would you think this would be smooth and efficient?

Have you ever tried automating Linux with VB Script? This is almost the same thing.

[1] Someone had a similar example here using Logic Apps and a Template Spec: https://cloudjourney.medium.com/azure-template-spec-and-logi...

>>jbvers+G2
8 hours of downtime, but not data loss, since there was no data to lose during the downtime.

Last post before we went down (2022-07-08 12:46:04 UTC): https://news.ycombinator.com/item?id=32026565

First post once we were back up (2022-07-08 20:30:55 UTC): https://news.ycombinator.com/item?id=32026571 (hey, that's this thread! how'd you do that, tpmx?)

So, 7h 45m of downtime. What we don't know is how many posts (or votes, etc.) happened after our last backup, and were therefore lost. The latest vote we have was at 2022-07-08 12:46:05 UTC, which is about the same as the last post.

There can't be many lost posts or votes, though, because I checked HN Search (https://hn.algolia.com/) just before we brought HN back up, and their most recent comment and story were behind ours. That means our last backup on the ill-fated server was taken after the last API update (HN Search relies on our API), and the API gets updated every 30 seconds.

I'm not saying that's a rock-solid argument, but it suggests that 30 seconds is an upper bound on how much data we lost.

>>tpmx+(OP)
Clearly a correlation between 2 hard drives failing at HN [1] and the Canadian Rogers network outage [2].

[1] https://twitter.com/HNStatus

[2] https://www.reuters.com/business/media-telecom/rogers-commun...

>>digita+Ni1
Veteran programmer and HN user kabdib has proposed a striking theory that could explain everything: https://news.ycombinator.com/item?id=32028511.

>>kabdib+md1
I wonder if it might be closer to 40,032 hours. The official Dell wording [1] is "after approximately 40,000 hours of usage". 2^57 nanoseconds is 40031.996687737745 hours. Not sure what's special about 57, but a power of 2 limit for a counter makes sense. That time might include some manufacturer testing too.

[1] https://www.reddit.com/r/sysadmin/comments/f5k95v/dell_emc_u...

>>Johnny+L2
The postmortem is sort of dissolved into the bloodstream of these threads:

Tell HN: HN Moved from M5 to AWS - >>32030400 - July 2022 (116 comments)

Ask HN: What'd you do while HN was down? - >>32026639 - July 2022 (218 comments)

HN is up again - >>32026571 - July 2022 (314 comments)

I'd particularly look here: >>32026606 and here: >>32031025 .

If you scroll through my comments from today via https://news.ycombinator.com/comments?id=dang&next=32039936, there are additional details. (Sorry for recommending my own comments.)

If you (or anyone) skim through that stuff and have a question that isn't answered there, I'd be happy to take a crack at it.

>>davedu+b2
There's a principle in aviation of staggering engine maintenance on multiple-engined airplanes to avoid maintenance-induced errors leading to complete power loss.

e.g. Simultaneous Engine Maintenance Increases Operating Risks, Aviation Mechanics Bulletin, September–October 1999 https://flightsafety.org/amb/amb_sept_oct99.pdf

>>pmoria+5O1
https://news.ycombinator.com/reply?id=32033520&goto=item%3Fi...

I've seen too many dead disks with a perfect SMART. When the numbers go down (or up) and triggers are fired then you are surely need to replace the disk[0], but SMART without warnings just means nothing.

[0] my desktop run for years entirely on the disks removed from the client PCs after a failure. Some of them had a pretty bad SMART, on a couple I needed to move the starting point of the partition a couple GBs further from the sector 0 (otherwise they would stall pretty soon), but overall they worked fine - but I never used them as a reliable storage and I knew I can lose them anytime.

Of course I don't use repurposed drives in the servers.

PS and when I tried to post it I received " We're having some trouble serving your request. Sorry! " Sheesh.

>>GekkeP+ZZ1
It's truly tragic how few people have heard of this. We should be doing all we can to raise awareness of this lawsuit before it's too late.

Here are some relevant links:

https://news.ycombinator.com/item?id=31703394

https://decrypt.co/31906/activists-rally-save-internet-archi...

https://www.courtlistener.com/docket/17211300/hachette-book-...

>>bbkane+801
> AD roles and other authn aren't accessible via ARM templates/Bicep

I normally bill for cloud automation advice, but the gist is:

You can automate RBAC/IAM via Bicep or ARM[1], but only for existing groups or system managed identities or user managed identities. This usually covers everything that is typically done for cloud automation.

Note that the initial setup might require "manual" steps to set up the groups and their memberships, but then the rest can be automated. In other words, there's a one-time "prerequisites" step followed by 'n' fully automated deployments.

You can also use templates to deploy groups dynamically[2] if you really need to, but this ought to be rare. The problem with this is that templates are designed to deploy resources, and AAD groups aren't resources.

More generally, your mistake IMHO was to try to automate the automation itself, while side-stepping the Azure-native automation tooling by choosing Terraform+Functions instead of Template Specs with delegated permissions via Azure RBAC. Most of your template is used to deploy the infrastructure to deploy a relatively simple template!

This reminds me of people writing VB Scripts to generate CMD files that generate VB Scripts to trigger more scripts in turn. I wish I was kidding, but a huge enterprise did this seven levels deep for a critical systems-management processes. It broke, and caused massive problems. Don't do this, just KISS and remember https://xkcd.com/1205/

[1] via Microsoft.Authorization/roleAssignments

[2] via Microsoft.Resources/deploymentScripts

>>tempes+BH1
It's also an example of the dharma of /newest – the rising and falling away of stories that get no attention:

HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours - https://news.ycombinator.com/item?id=22706968 - March 2020 (0 comments)

HPE SSD flaw will brick hardware after 40k hours - https://news.ycombinator.com/item?id=22697758 - March 2020 (0 comments)

Some HP Enterprise SSD will brick after 40000 hours without update - https://news.ycombinator.com/item?id=22697001 - March 2020 (1 comment)

HPE Warns of New Firmware Flaw That Bricks SSDs After 40k Hours of Use - https://news.ycombinator.com/item?id=22692611 - March 2020 (0 comments)

HPE Warns of New Bug That Kills SSD Drives After 40k Hours - https://news.ycombinator.com/item?id=22680420 - March 2020 (0 comments)

(there's also https://news.ycombinator.com/item?id=32035934, but that was submitted today)

>>jiggaw+zY
> Is this... a deliberate attempt at constructing a Rube Goldberg machine?

> I normally bill for cloud automation advice, but the gist is

Can you please omit supercilious swipes from your comments here? Everybody knows different things. If you know more than someone else about $thing, that's great—but please don't put them down for it. That's not in the spirit of kindness and curious conversation that we're hoping for here.

https://news.ycombinator.com/newsguidelines.html

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...

>>kabdib+md1
Bang on!

https://news.ycombinator.com/item?id=32048148