I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.
The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.
I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.
For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.
I'm more talking about how slow it was to detect the issue caused by the config change, and perform the rollback of the config change. It took 20 minutes.