What’s more concerning to me is that now we’ve had AWS, Azure, and CloudFlare (and CliudFlare twice) go down recently. My gut says:
1. developers and IT are using LLMs in some part of the process, which will not be 100% reliable.
2. Current culture of I have (some personal activity or problem) or we don’t have staff, AI will replace me, f-this.
3. Pandemic after effects.
4. Political climate / war / drugs; all are intermingled.
Which makes it feel that much more special when a service provides open access to all of the infrastructure diagnostics, like e.g. https://status.ppy.sh/
Not really, they're just lying. I mean yes of course they aren't oracles who discover complex problems in instant of the first failure, but naw they know when well there are problems and significantly underreport them to the extent they are are less "smoke alarms" and more "your house has burned down and the ashes are still smoldering" alarms. Incidents are intentionally underreported. It's bad enough that there ought to be legislation and civil penalties for the large providers who fail to report known issues promptly.
If enough high-ranked customers report this feedback...
This seems like an issue with the design of your status page. If the broken dependencies truly had a limited blast radius, that should've been able to be communicated in your indicators and statistics. If not, then the unreliable reputation was deserved, and all you did by removing the status page was hide it.
True, but everyone that actually made the company work was much happier for it.
You are responsible of your dependencies, unless they are specific integrations. Either switch to more reliable dependencies or add redundancy so that you can switch between providers when any is down.