Comparing the difficulty of running the world’s internet traffic with hundreds of customer products with your fintech experience is like saying “I can lift 10 pounds. I don’t know why these guys are struggling to lift 500 pounds”.
This kind of thing would be more understandable for a company without hundreds of billions of dollars, and for one that hasn't centralized so much of the internet. If a company has grown too large and complex to be well managed and effective and it's starting to look like a liability for large numbers of people there are obvious solutions for that.
If this were purely a money problem it would have been solved ages ago. It’s a difficult problem to solve. Also, they’re the youngest of the major cloud providers and have a fraction of the resources that Google, Amazon, and Microsoft have.
That fact that no major cloud provider is actually good is not an argument that cloudflare isn't bad, or even that they couldn't/shouldn't do better than they are. They have fewer resources than Google or Microsoft but they're also in a unique position that makes us differently vulnerable when they fuck up. It's not all their fault, since it was a mistake to centralize the internet to the extent that we have in the first place, but now that they are responsible for so much they have to expect that people will be upset when they fail.
Honestly we shouldn't have created a system where any single company's failure is able to impact such a huge percentage of the network. The internet was designed for resilience and we abandoned that ideal to put our trust in a single company that maybe isn't up for the job. Maybe no one company ever could do it well enough, but I suspect that no single company should carry that responsibility in the first place.
If there’s indeed a 5 min lag in monitoring dashboard in Cloudflare, I honestly think that's a pretty big concern.
For example, a simple curl script on your top 100 customers' homepage that runs every 30 seconds would have given the warning and notifications within a minute. If you stagger deployments at 5 minute intervals, you could have identified the issue and initiated the rollback within 2 minutes and completed it within 3 minutes.
Could cloudflare do better? Sure, that’s a truism for everyone. Did they make mistakes and continue to make mistakes? Also a truism.
Trust me, they are acutely aware of people getting upset when they fail. Why do you think they’re CEO and CTO are writing these blog posts?