zlacker

[parent] [thread] 8 comments
1. everfr+(OP)[view] [source] 2026-02-03 01:53:41
How is Azure still having faults that affect multiple regions? Clearly their region definition is bollocks.
replies(1): >>ragall+Ai
2. ragall+Ai[view] [source] 2026-02-03 04:20:35
>>everfr+(OP)
All 3 hyperscalers have vulnerabilities in their control planes: they're either single point of failure like AWS with us-east-1, or global meaning that a faulty release can take it down entirely; and take AZ resilience to mean that existing compute will continue to work as before, but allocation of new resources might fail in multi-AZ or multi-region ways.

It means that any service designed to survive a control plane outage must statically allocate its compute resources and have enough slack that it never relies on auto scaling. True for AWS/GCP/Azure.

replies(2): >>tbrown+Qj >>everfr+Zl
◧◩
3. tbrown+Qj[view] [source] [discussion] 2026-02-03 04:30:38
>>ragall+Ai
> It means that any service designed to survive a control plane outage must statically allocate its compute resources and have enough slack that it never relies on auto scaling. True for AWS/GCP/Azure.

That sounds oddly similar to owning hardware.

replies(1): >>ragall+Dn
◧◩
4. everfr+Zl[view] [source] [discussion] 2026-02-03 04:50:18
>>ragall+Ai
This outage talks about what appears to be a VM control plane failure (it mentions stop not working) across multiple regions.

AWS has never had this type of outage in 20 years. Yet Azure constantly had them.

This is a total failure of engineering and has nothing to do with capacity. Azure is a joke of a cloud.

replies(2): >>mirash+in >>ragall+sn
◧◩◪
5. mirash+in[view] [source] [discussion] 2026-02-03 05:02:54
>>everfr+Zl
AWS had an outage that blocked all EC2 operations just a few months ago: https://aws.amazon.com/message/101925/
replies(2): >>everfr+lu >>jamesf+xI3
◧◩◪
6. ragall+sn[view] [source] [discussion] 2026-02-03 05:04:49
>>everfr+Zl
I do agree that Azure seems to be a lot worse: its control plane(s) seems to be much more centralized than the other two.
◧◩◪
7. ragall+Dn[view] [source] [discussion] 2026-02-03 05:06:50
>>tbrown+Qj
In a way. It means that you can get new capacity most often, but the transition windows where a service gets resized (or mutated in general) has to be minimised and carefully controlled by ops.
◧◩◪◨
8. everfr+lu[view] [source] [discussion] 2026-02-03 06:11:17
>>mirash+in
This was the largest AWS outage in a long long time and was still constrained to a single AWS region.

Which is my point.

The same fault on Azure would be a global (all-regions) fault.

◧◩◪◨
9. jamesf+xI3[view] [source] [discussion] 2026-02-04 00:28:15
>>mirash+in
Yeah I remember one maybe four years ago? Existing workloads were fine but I had to go and tell my marketing department to not do anything until it was sorted because auto-scaling was busted.
[go to top]