Is that the case, though? My understanding was, that even if I run a docker container as root and the container is 100% compromised, there still would need to be a vulnerability in docker for it to “attack” the host, or am I missing something?
Also, if you've been compromised, you may have a rootkit that hides itself from the filesystem, so you can't be sure of a file's existence through a simple `ls` or `stat`.
The core of the problem here is that process isolation doesn't save you from whole classes of attack vectors or misconfigurations that open you up to nasty surprises. Docker is great, just don't think of it as a sandbox to run untrusted code.
Thanks for mentioning it - but now... how does one deal with this?
Second, even if your Docker container is configured properly, the attacker gets to call themselves root and talk to the kernel. It's a security boundary, sure, but it's not as battle-tested as the isolation of not being root, or the isolation between VMs.
Thirdly, in the stock configuration processes inside a docker container can use loads of RAM (causing random things to get swapped to disk or OOM killed), can consume lots of CPU, and can fill your disk up. If you consider denial-of-service an attack, there you are.
Fourthly, there are a bunch of settings that disable the security boundary, and a lot of guides online will tell you to use them. Doing something in Docker that needs to access hot-plugged webcams? Hmm, it's not working unless I set --privileged - oops, there goes the security boundary. Trying to attach a debugger while developing and you set CAP_SYS_PTRACE? Bypasses the security boundary. Things like that.
* but if you’re used to bind-mounting, they’ll be a hassle
Edit: This is by no means comprehensive, but I feel compelled to point it out specifically for some reason: remember not to mount .git writable, folks! Write access to .git is arbitrary code execution as whoever runs git.
Imagine naming this executable "ls" or "echo" and someone having "." in their path (which is why you shouldn't): as long as you do "ls" in this directory, you've ran compromised code.
There are obviously other ways to get that executable to be run on the host, this just a simple example.
Docker is pretty much the same but supposedly more flimsy.
Both have non-obvious configuration weaknesses that can lead to escapes.
You might still want to tighten things up. Just adding on the "rootless" part - running the container runtime as an unprivileged user on the host instead of root - you also want to run npm/node as unprivileged user inside the container. I still see many defaulting to running as root inside the container since that's the default of most images. OP touches on this.
For rootless podman, this will run as a user with your current uid and map ownership of mounts/volumes:
podman run -u$(id -u) --userns=keep-idbut will not stop serious malware
OTH if I had written such a script for linux I'd be looking to grab the contents of $(hist) $(env) $(cat /etc/{group,passwd})... then enumerate /usr/bin/ /usr/local/bin/ and the XDG_{CACHE,CONFIG} dirs - some plaintext credentials are usually here.
The $HOME/.{aws,docker,claude,ssh}
Basically the attacker just needs to know their way around your OS. The script enumerating these directories is the 0777 script they were able to write from inside the root access container.
Are you holding millions of dollars in crypto/sensitive data? Better assume the machine and data is compromised and plan accordingly.
Is this your toy server for some low-value things where nothing bad can happen besides a bit of embarrassment even if you do get hit by a container escape zero-day? You're probably fine.
This attack is just a large-scale automated attack designed to mine cryptocurrency; it's unlikely any human ever actually logged into your server. So cleaning up the container is most likely fine.
Go and Rust tend to lend themselves to these more restrictive environments a bit better than other options.
non necessary vulnerability per. se. Bridged adapter for example lets you do a lot - few years ago there were a story of something like how a guy got a root in container and because the container used bridged adapter he was able to intercept traffic of an account info updates on GCP
Honestly, citation needed. Very rare unless you're literally giving the container access to write to /usr/bin or other binaries the host is running, to reconfigure your entire /etc, access to sockets like docker's, or some other insane level of over reach I doubt even the least educated docker user would do.
While of course they should be scoped properly, people act like some elusive 0-day container escape will get used on their minecraft server or personal blog that has otherwise sane mounts, non-admin capabilities, etc. You arent that special.
Attacker now needs a Docker exploit and then a VM exploit before getting to the hypervisor (and, no, pwning the VM ain't the same as pwning the hypervisor).
I disagree with other commenters here that Docker is not a security boundary. It's a fine one, as long as you don't disable the boundary, which is as easy as running a container with `--privileged`. I wrote about secure alternatives for devcontainers here: https://cgamesplay.com/recipes/devcontainers/#docker-in-devc...
Unfortunately, user namespaces are still not the default configuration with Docker (even though the core issues that made using them painful have long since been resolved).
And a shocking number of tutorials recommend bind-mounting docker.sock into the container without any warning (some even tell you to mount it "ro" -- which is even funnier since that does nothing). I have a HN comment from ~8 years ago complaining about this.
Not only does it allow me to partition the host for workloads but I also get security boundaries as well. While it may be a slight performance hit the segmentation also makes more logical sense in the way I view the workloads. Finally, it's trivial to template and script, so it's very low maintenance and allows for me to kill an LXC and just reprovision it if I need to make any significant changes. And I never need to migrate any data in this model (or very rarely).
The only serious company that I'm aware of which doesn't understand that is Microsoft, and the reason I know that is because they've been embarrassed again and again by vulnerabilities that only exist because they run multitenant systems with only containers for isolation
Of course if you have a kernel exploit you'd be able to break out (this is what gvisor mitigates to some extent), nothing seems to really protect against rowhammer/memory timing style attacks (but they don't seem to be commonly used). Beyond this, the main misconfigurations seem to be too wide volume bindings (e.g. something that allows access to the docker control socket from inside the container, or an obviously stupid mount like mounting your root inside the container).
Am I missing something?
Its all turtles, all the way down.
But for a typical case, VM's are the bare minimum to say you have a _secure_ isolation boundary because the attack surface is way smaller.
https://support.broadcom.com/web/ecx/support-content-notific...
https://nvd.nist.gov/vuln/detail/CVE-2019-5183
https://nvd.nist.gov/vuln/detail/CVE-2018-12130
https://nvd.nist.gov/vuln/detail/CVE-2018-2698
https://nvd.nist.gov/vuln/detail/CVE-2017-4936
In the end you need to configure it properly and pray there's no escape vulnerabilities. The same standard you applied to containers to say they're definitely never a security boundary. Seems like you're drawing some pretty arbitrary lines here.
While I generally agree with the technical argument, I fail to see the threat model here. Is it that some external threat would have prior knowledge that an important target is in close proximity to a less hardened one? It doesn't seem viable to me for nation states to spend the expensive R&D to compromise hobbyist-adjacent services in a hope that they can discover more valuable data on the host hypervisor.
Once such expensive malware is deployed, there's a huge risk that all the R&D money is spent on potentially just reconnaissance.