Double disk failure is improbable but not impossible.
The most impressive thing is that there seems to be no dataloss, almost whatsoever. Whatever the backup system is, it seems rock solid.
https://twitter.com/HNStatus/status/1545461511870566400
Disk and fallback server failure. Was definitely a long day for their ops team, on a Friday no less.
Over a year with no issues. Impressive.
____________
Related:
https://www.newyorker.com/news/letter-from-silicon-valley/th...
https://news.ycombinator.com/item?id=32026565
The ones after it are hours later and usually deleted, until this post (...71).
This logs lesser ones: https://hn.hund.io/
GitHub page for that project: https://github.com/clintonwoo/hackernews-remix-react
https://web.archive.org/web/20220330032426/https://ops.faith...
Primary failure: https://news.ycombinator.com/item?id=32024036 Standby failure: https://twitter.com/HNStatus/status/1545409429113229312
each server has a pair of mirrored disks, so it seems we're talking about 4 drives failing, not just 2.
On the other hand the primary seems to have gone down 6 hours before the backup server did, so the failures weren't quite simultaneous.
But even comparing the apples to the oranges, this HN status page someone else pointed out https://hn.hund.io/ seems to show that HN has had more than one outage in just the past month. All but today's and last night's being quite short, but still. Sometimes you need some extra complexity if you want to make it to zero downtime overall.
That's not something the HN website needs but I think AWS is doing fine even if that's your point of comparison.
https://www.neoseeker.com/news/18098-64gb-crucial-m4s-crashi...
Is it more appropriate to call the strategy in this case fallback, or failover? Since the secondary server wasn't running in production until the first one failed, it sounds like fallback?
Perhaps higher reliability strategies would have been instead of having a secondary server, just have more mirrored disks on the main server, to reduce the likelihood of the array being compromised?
Alternatively, to run both the primary and secondary servers in production all the time. But that would presumably merely move the single point of failure to the proxy?
[0] https://aws.amazon.com/builders-library/avoiding-fallback-in...
Edit - HN is on AWS now. https://news.ycombinator.com/item?id=32026571
https://check-host.net/ip-info?host=https://news.ycombinator...
https://search.arin.net/rdap/?query=50.112.136.166
Note: HN has been on M5 hosting for years and they were still there as of 16-hours ago per Dang:
https://news.ycombinator.com/item?id=32024105
During the outage, listed places to check HN related systems, posted them here:
My solution: a 3-hour focus mode browser extension.
1. Install the BlockSite chrome extension [1]. 2. In BlockSite settings, add HN, Twitter, and any other distracting sites to the Focus Mode list and set Focus Mode time to 3 hours. 3. Ensure you uninstall all social media apps from your phone 4. When I find myself opening a new tab and typing "n" to get a dopamine hit, I then turn on my 3-hour focus mode.
[1] https://chrome.google.com/webstore/detail/blocksite-block-we...
In all seriousness, at least 2/3rds of the complexity is because of your choice of tools and approach. Terraform alone makes things significantly more complex. If you just want to trigger a deployment, then a Template Spec made from a Bicep file could be banged out in like... an hour.[1]
When in Rome, do as the Romans do. You basically took a Microsoft product and tried to automate it with a bunch of Linux-native tools. Why would you think this would be smooth and efficient?
Have you ever tried automating Linux with VB Script? This is almost the same thing.
[1] Someone had a similar example here using Logic Apps and a Template Spec: https://cloudjourney.medium.com/azure-template-spec-and-logi...
Last post before we went down (2022-07-08 12:46:04 UTC): https://news.ycombinator.com/item?id=32026565
First post once we were back up (2022-07-08 20:30:55 UTC): https://news.ycombinator.com/item?id=32026571 (hey, that's this thread! how'd you do that, tpmx?)
So, 7h 45m of downtime. What we don't know is how many posts (or votes, etc.) happened after our last backup, and were therefore lost. The latest vote we have was at 2022-07-08 12:46:05 UTC, which is about the same as the last post.
There can't be many lost posts or votes, though, because I checked HN Search (https://hn.algolia.com/) just before we brought HN back up, and their most recent comment and story were behind ours. That means our last backup on the ill-fated server was taken after the last API update (HN Search relies on our API), and the API gets updated every 30 seconds.
I'm not saying that's a rock-solid argument, but it suggests that 30 seconds is an upper bound on how much data we lost.
[1] https://twitter.com/HNStatus
[2] https://www.reuters.com/business/media-telecom/rogers-commun...
[1] https://www.reddit.com/r/sysadmin/comments/f5k95v/dell_emc_u...
Tell HN: HN Moved from M5 to AWS - >>32030400 - July 2022 (116 comments)
Ask HN: What'd you do while HN was down? - >>32026639 - July 2022 (218 comments)
HN is up again - >>32026571 - July 2022 (314 comments)
I'd particularly look here: >>32026606 and here: >>32031025 .
If you scroll through my comments from today via https://news.ycombinator.com/comments?id=dang&next=32039936, there are additional details. (Sorry for recommending my own comments.)
If you (or anyone) skim through that stuff and have a question that isn't answered there, I'd be happy to take a crack at it.
e.g. Simultaneous Engine Maintenance Increases Operating Risks, Aviation Mechanics Bulletin, September–October 1999 https://flightsafety.org/amb/amb_sept_oct99.pdf
I've seen too many dead disks with a perfect SMART. When the numbers go down (or up) and triggers are fired then you are surely need to replace the disk[0], but SMART without warnings just means nothing.
[0] my desktop run for years entirely on the disks removed from the client PCs after a failure. Some of them had a pretty bad SMART, on a couple I needed to move the starting point of the partition a couple GBs further from the sector 0 (otherwise they would stall pretty soon), but overall they worked fine - but I never used them as a reliable storage and I knew I can lose them anytime.
Of course I don't use repurposed drives in the servers.
PS and when I tried to post it I received " We're having some trouble serving your request. Sorry! " Sheesh.
Here are some relevant links:
https://news.ycombinator.com/item?id=31703394
https://decrypt.co/31906/activists-rally-save-internet-archi...
https://www.courtlistener.com/docket/17211300/hachette-book-...
I normally bill for cloud automation advice, but the gist is:
You can automate RBAC/IAM via Bicep or ARM[1], but only for existing groups or system managed identities or user managed identities. This usually covers everything that is typically done for cloud automation.
Note that the initial setup might require "manual" steps to set up the groups and their memberships, but then the rest can be automated. In other words, there's a one-time "prerequisites" step followed by 'n' fully automated deployments.
You can also use templates to deploy groups dynamically[2] if you really need to, but this ought to be rare. The problem with this is that templates are designed to deploy resources, and AAD groups aren't resources.
More generally, your mistake IMHO was to try to automate the automation itself, while side-stepping the Azure-native automation tooling by choosing Terraform+Functions instead of Template Specs with delegated permissions via Azure RBAC. Most of your template is used to deploy the infrastructure to deploy a relatively simple template!
This reminds me of people writing VB Scripts to generate CMD files that generate VB Scripts to trigger more scripts in turn. I wish I was kidding, but a huge enterprise did this seven levels deep for a critical systems-management processes. It broke, and caused massive problems. Don't do this, just KISS and remember https://xkcd.com/1205/
[1] via Microsoft.Authorization/roleAssignments
[2] via Microsoft.Resources/deploymentScripts
HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours - https://news.ycombinator.com/item?id=22706968 - March 2020 (0 comments)
HPE SSD flaw will brick hardware after 40k hours - https://news.ycombinator.com/item?id=22697758 - March 2020 (0 comments)
Some HP Enterprise SSD will brick after 40000 hours without update - https://news.ycombinator.com/item?id=22697001 - March 2020 (1 comment)
HPE Warns of New Firmware Flaw That Bricks SSDs After 40k Hours of Use - https://news.ycombinator.com/item?id=22692611 - March 2020 (0 comments)
HPE Warns of New Bug That Kills SSD Drives After 40k Hours - https://news.ycombinator.com/item?id=22680420 - March 2020 (0 comments)
(there's also https://news.ycombinator.com/item?id=32035934, but that was submitted today)
> I normally bill for cloud automation advice, but the gist is
Can you please omit supercilious swipes from your comments here? Everybody knows different things. If you know more than someone else about $thing, that's great—but please don't put them down for it. That's not in the spirit of kindness and curious conversation that we're hoping for here.
https://news.ycombinator.com/newsguidelines.html
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...