zlacker

[parent] [thread] 3 comments
1. bjackm+(OP)[view] [source] 2025-05-31 08:10:25
I see your point but even if your VMM is a zillion lines of C++ with emulated devices there are opportunities to secure it that don't exist with a shared-monolithic-kernel container runtime.

You can create security boundaries around (and even within!) the VMM. You can make it so an escape into the VMM process has only minimal value, by sandboxing the VMM aggressively.

Plus you can absolutely escape the model of C++ emulating devices. Ideally I think VMMs should do almost nothing but manage VF passthroughs. Of course then we shift a lot of the problem onto the inevitably completely broken device firmware but again there are more ways to mitigate that than kernel bugs.

replies(1): >>delusi+86
2. delusi+86[view] [source] 2025-05-31 09:51:23
>>bjackm+(OP)
Could you elaborate on how you could secure those architectures better? It's unclear to me how being in device firmware or being a VMM provides you with any further abilities. Surely you still have the same fundamental problem of being a shared resource.

Intuitively there are differences. The Linux kernel is fucking huge, and anything that could bake the "shared resources" down to less than the entire kernel would be easier to verify, but that would also be true for an entirely software based abstraction inside the kernel.

In a way it's the whole micro kernel discussion again.

replies(1): >>bjackm+r7
◧◩
3. bjackm+r7[view] [source] [discussion] 2025-05-31 10:14:49
>>delusi+86
When you escape a container generally you can do whatever the kernel can do. There is no further security boundary.

If you escape into a VMM you can do whatever the VMM can do. You can build a system where it can not do very much more than the VM guest itself. By the time the guest boots the process containing the vCPU threads has already lost all its interesting privileges and has no credentials of value.

Similar with device passthrough. It's not very interesting if the device you're passing through ultimately has unchecked access to PCIe but if you have a proper ioMMU set up it should be possible to have a system where pwning the device firmware is just a small step rather than an immediate escalation to root-equivalent. (I should say, I don't know if this system actually exists today, I just know it's possible).

With a VMM escape your next step is usually to exploit the kernel. But if you sandbox the VMM properly there is very limited kernel attack surface available to it.

So yeah you're right it's similar to the microkernel discussion. You could develop these properties for a shared-kernel container runtime... By making it a microkernel.

It's just that isn't a path with any next steps in the real world. The road from Docker to a secure VM platform is rich with reasonable incremental steps forward (virtualization is an essential step but it's still just one of many). The road from Docker to a microkernel is... Rewrite your entire platform and every workload!

replies(1): >>delusi+py
◧◩◪
4. delusi+py[view] [source] [discussion] 2025-05-31 15:31:54
>>bjackm+r7
> It's just that isn't a path with any next steps in the real world.

It appears we find ourselves at the Theory/Praxis intersection once again.

> The road from Docker to a secure VM platform is rich with reasonable incremental steps forward

The reason it seems so reasonable is that it's well trodden. There were an infinity of VM platforms before Docker, and they were all discarded for pretty well known engineering reasons mostly to do with performance, but also for being difficult for developers to reason about. I have no doubt that there's still dialogue worth having between those two approaches, but cgroups isn't a "failed" VM security boundary anymore than Linux is a failed micro kernel. It never aimed to be a VM-like security boundary.

[go to top]