I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.
The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.
I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.
For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.
Comparing the difficulty of running the world’s internet traffic with hundreds of customer products with your fintech experience is like saying “I can lift 10 pounds. I don’t know why these guys are struggling to lift 500 pounds”.
This kind of thing would be more understandable for a company without hundreds of billions of dollars, and for one that hasn't centralized so much of the internet. If a company has grown too large and complex to be well managed and effective and it's starting to look like a liability for large numbers of people there are obvious solutions for that.
Honestly we shouldn't have created a system where any single company's failure is able to impact such a huge percentage of the network. The internet was designed for resilience and we abandoned that ideal to put our trust in a single company that maybe isn't up for the job. Maybe no one company ever could do it well enough, but I suspect that no single company should carry that responsibility in the first place.