zlacker

[return to "GitHub experience various partial-outages/degradations"]
1. llama0+t8[view] [source] 2026-02-02 22:02:47
>>bhoust+(OP)
Looks like Azure as a platform just killed the ability for VM scale operations, due to a change on a storage account ACL that hosted VM extensions. Wow... We noticed when github actions went down, then our self hosted runners because we can't scale anymore.

Information

Active - Virtual Machines and dependent services - Service management issues in multiple regions

Impact statement: As early as 19:46 UTC on 2 February 2026, we are aware of an ongoing issue causing customers to receive error notifications when performing service management operations - such as create, delete, update, scaling, start, stop - for Virtual Machines (VMs) across multiple regions. These issues are also causing impact to services with dependencies on these service management operations - including Azure Arc Enabled Servers, Azure Batch, Azure DevOps, Azure Load Testing, and GitHub. For details on the latter, please see https://www.githubstatus.com.

Current status: We have determined that these issues were caused by a recent configuration change that affected public access to certain Microsoft‑managed storage accounts, used to host extension packages. We are actively working on mitigation, including updating configuration to restore relevant access permissions. We have applied this update in one region so far, and are assessing the extent to which this mitigates customer issues. Our next update will be provided by 22:30 UTC, approximately 60 minutes from now.

https://azure.status.microsoft/en-us/status

◧◩
2. bob102+vq[view] [source] 2026-02-02 23:07:01
>>llama0+t8
They've always been terrible at VM ops. I never get weird quota limits and errors in other places. It's almost as if Amazon wants me to be a customer and Microsoft does not.
◧◩◪
3. everfr+BT[view] [source] 2026-02-03 01:53:41
>>bob102+vq
How is Azure still having faults that affect multiple regions? Clearly their region definition is bollocks.
◧◩◪◨
4. ragall+bc1[view] [source] 2026-02-03 04:20:35
>>everfr+BT
All 3 hyperscalers have vulnerabilities in their control planes: they're either single point of failure like AWS with us-east-1, or global meaning that a faulty release can take it down entirely; and take AZ resilience to mean that existing compute will continue to work as before, but allocation of new resources might fail in multi-AZ or multi-region ways.

It means that any service designed to survive a control plane outage must statically allocate its compute resources and have enough slack that it never relies on auto scaling. True for AWS/GCP/Azure.

◧◩◪◨⬒
5. tbrown+rd1[view] [source] 2026-02-03 04:30:38
>>ragall+bc1
> It means that any service designed to survive a control plane outage must statically allocate its compute resources and have enough slack that it never relies on auto scaling. True for AWS/GCP/Azure.

That sounds oddly similar to owning hardware.

◧◩◪◨⬒⬓
6. ragall+eh1[view] [source] 2026-02-03 05:06:50
>>tbrown+rd1
In a way. It means that you can get new capacity most often, but the transition windows where a service gets resized (or mutated in general) has to be minimised and carefully controlled by ops.
[go to top]