At scale there's no such thing as "instant". There is distribution of progress over time.
The failure is an event. Collection of events takes times (at scale, going through store and forward layers). Your "monitorable failure rate" is over an interval. You must measure for that interval. And then you are going to emit another event.
Global config systems are a tradeoff. They're not inherently bad; they have both strengths and weaknesses. Really bad: non-zero possibility for system collapse. Bad: Can progress very quickly to towards global outages. Good: Faults are detected quickly, response decision making is easy, and mitigation is fast.
Hyperscale is not just "a very large number of small simple systems".
Denoising alerts is a fact of life for SRE...and is a survival skill.