I'd like to see a formal container security grade that works like:
1) Curate a list of all known (container) exploits
2) Run each exploit in environments of increasing security like permissions-based, jail, Docker and emulator
3) The percentage of prevented exploits would be the score from 0-100%
Under this scheme, I'd expect naive attempts at containerization with permissions and jails to score around 0%, while Docker might be above 50% and Microsandbox could potentially reach 100%.This might satisfy some of our intuition around questions like "why not just use a jail?". Also the containers could run on a site on the open web as honeypots with cash or crypto prizes for pwning them to "prove" which containers achieve 100%.
We might also need to redefine what "secure" means, since exploits like Rowhammer and Spectre may make nearly all conventional and cloud computing insecure. Or maybe it's a moving target, like how 64 bit encryption might have once been considered secure but now we need 128 bit or higher.
Edit: the motivation behind this would be to find a container that's 100% secure without emulation, for performance and cost-savings benefits, as well as gaining insights into how to secure operating systems by containerizing their various services.
The only way to make Linux containers a meaningful sandbox is to drastically restrict the syscall API surface available to the sandboxee, which quickly reduces its value. It's no longer a "generic platform that you can throw any workload onto" but instead a bespoke thing that needs to be tuned and reconfigured for every usecase.
This is why you need virtualization. Until we have a properly hardened and memory safe OS, it's the only way. And if we do build such an OS it's unclear to me whether it will be faster than running MicroVMs on a Linux host.
The only meaningful difference is that Linux containers target partitioning Linux kernel services which is a shared-by-default/default-allow environment that was never designed for and has never achieved meaningful security. The number of vulnerabilities resulting from, "whoopsie, we forgot to partition shared service 123" would be hilarious if it were not a complete lapse of security engineering in a product people are convinced is adequate for security-critical applications.
Present a vulnerability assessment demonstrating a team of 10 with 3 years time (~10-30 M$, comparable to many commercially-motivated single-victim attacks these days) can find no vulnerabilities in your deployment or a formal proof of security and correctness otherwise we should stick with the default assumption that software if easily hacked instead of the extraordinary claim that demands extraordinary evidence.
You can create security boundaries around (and even within!) the VMM. You can make it so an escape into the VMM process has only minimal value, by sandboxing the VMM aggressively.
Plus you can absolutely escape the model of C++ emulating devices. Ideally I think VMMs should do almost nothing but manage VF passthroughs. Of course then we shift a lot of the problem onto the inevitably completely broken device firmware but again there are more ways to mitigate that than kernel bugs.
Intuitively there are differences. The Linux kernel is fucking huge, and anything that could bake the "shared resources" down to less than the entire kernel would be easier to verify, but that would also be true for an entirely software based abstraction inside the kernel.
In a way it's the whole micro kernel discussion again.
If you escape into a VMM you can do whatever the VMM can do. You can build a system where it can not do very much more than the VM guest itself. By the time the guest boots the process containing the vCPU threads has already lost all its interesting privileges and has no credentials of value.
Similar with device passthrough. It's not very interesting if the device you're passing through ultimately has unchecked access to PCIe but if you have a proper ioMMU set up it should be possible to have a system where pwning the device firmware is just a small step rather than an immediate escalation to root-equivalent. (I should say, I don't know if this system actually exists today, I just know it's possible).
With a VMM escape your next step is usually to exploit the kernel. But if you sandbox the VMM properly there is very limited kernel attack surface available to it.
So yeah you're right it's similar to the microkernel discussion. You could develop these properties for a shared-kernel container runtime... By making it a microkernel.
It's just that isn't a path with any next steps in the real world. The road from Docker to a secure VM platform is rich with reasonable incremental steps forward (virtualization is an essential step but it's still just one of many). The road from Docker to a microkernel is... Rewrite your entire platform and every workload!