Firecracker has a balloon device you can inflate (ie: acquire as much memory inside the VM as possible) and then deflate... returning the memory to the host. You can do this while the VM is running.
https://github.com/firecracker-microvm/firecracker/blob/main...
EDIT: Okay, https://www.geekwire.com/2018/firecracker-amazon-web-service... says my "pretty sure" memory is in fact correct.
In really simple terms, so simple that I'm not 100% sure they are correct:
* KVM is a hypervisor, or rather it lets you turn linux into a hypervisor [1], which will let you run VMs on your machine. I've heard KVM is rather hard to work with (steep learning curve). (Xen is also a hypervisor.)
* QEMU is a wrapper-of-a-sorts (a "machine emulator and virtualizer" [2]) which can be used on top of KVM (or Xen). "When used as a virtualizer, QEMU achieves near native performance by executing the guest code directly on the host CPU. QEMU supports virtualization when executing under the Xen hypervisor or using the KVM kernel module in Linux." [2]
* libvirt "is a toolkit to manage virtualization platforms" [3] and is used, e.g., by VDSM to communicate with QEMU.
* virt-manager is "a desktop user interface for managing virtual machines through libvirt" [4]. The screenshots on the project page should give an idea of what its typical use-case is - think VirtualBox and similar solutions.
* Proxmox is the above toolstack (-ish) but as one product.
---
[1] https://www.redhat.com/en/topics/virtualization/what-is-KVM
I didn't know it existed until they posted, but QEMU has a Firecracker-inspired target:
> microvm is a machine type inspired by Firecracker and constructed after its machine model.
> It’s a minimalist machine type without PCI nor ACPI support, designed for short-lived guests. microvm also establishes a baseline for benchmarking and optimizing both QEMU and guest operating systems, since it is optimized for both boot time and footprint.
It makes sure that if your VM and/or QEMU are broken out of, there are extra layers to prevent getting access to the whole physical machine. For example it runs QEMU as a very limited user and, if you're using SELinux, the QEMU process can hardly read any file other than the vm image file.
By contrast the method in the arch wiki runs QEMU as root. QEMU is exposed to all sort of untrusted input, so you really don't want it to run as root.
Libvirt also handles cross machine operations such as live migration, and makes it easier to query a bunch of things from QEMU.
For more info see https://www.redhat.com/en/blog/all-you-need-know-about-kvm-u...
See (OpenVZ) "Containers share dynamic libraries, which greatly saves memory." It's just 1 Linux kernel when you are running OpenVZ containers.
https://docs.openvz.org/openvz_users_guide.webhelp/_openvz_c...
See (KVM/KSM): "KSM enables the kernel to examine two or more already running programs and compare their memory. If any memory regions or pages are identical, KSM reduces multiple identical memory pages to a single page. This page is then marked copy on write."
https://access.redhat.com/documentation/en-us/red_hat_enterp...
In KVM's defense, it supports a much wider range of OSes; OpenVZ only really does different versions of Linux, while KVM can run OpenBSD/FreeBSD/NetBSD/Windows and even OS/2 in addition to Linux.
Other than making sure we release unused memory to the host, we didn't customize QEMU that much. Although we do have a cool layered storage solution - basically a faster alternative to QCOW2 that's also VMM independent. It's called overlaybd, and was created and implemented in Alibaba. That will probably be another blog post. https://github.com/containerd/overlaybd
zero-copy is harder as one system upgrade on one of them will trash it, but KSM is overall pretty effective at saving some memory on similar VMs
[1]: https://www.redhat.com/en/topics/virtualization/what-is-KVM
What actually can change is the amount of work that the kernel-mode hypervisor leaves to a less privileged (user space) component.
For more detail see https://www.spinics.net/lists/kvm/msg150882.html
> The distinction between these two types is not always clear. For instance, KVM and bhyve are kernel modules[6] that effectively convert the host operating system to a type-1 hypervisor.[7] At the same time, since Linux distributions and FreeBSD are still general-purpose operating systems, with applications competing with each other for VM resources, KVM and bhyve can also be categorized as type-2 hypervisors.[8]
[1]: https://www.usenix.org/system/files/nsdi20-paper-agache.pdf
We reclaim memory with a memory balloon device, for the disk trimming we discard (& compress) the disk, and for i/o speed we use io_uring (which we only use for scratch disks, the project disks are network disks).
It's a tradeoff. It's more work and does require custom implementations. For us that made sense, because in return we get a lightweight VMM that we can more easily extend with functionality like memory snapshotting and live VM cloning [1][2].
[1]: https://codesandbox.io/blog/how-we-clone-a-running-vm-in-2-s...
[2]: https://codesandbox.io/blog/cloning-microvms-using-userfault...
Fly.io themselves admitted they’re oversubscribed and AWS is doing the same for years now
My point is that these are largely appropriated terms - neither would fit the definitions of type 1 or type 2 from the early days when Popek and Goldberg were writing about them.
> Or does the thing at the bottom have to "play nice" with random other processes?
From this perspective, Xen doesn't count. You can have all sorts of issues from the dom0 side and competing with resources - you mention PV drivers later, and you can 100% run into issues with VMs because of how dom0 schedules blkback and netback when competing with other processes.
ESXi can also run plenty of unmodified linux binaries - go back in time 15 years and it's basically a fully featured OS. There's a lot running on it, too. Meanwhile, you can build a linux kernel with plenty of things switched off and a root filesystem with just the bare essentials for managing kvm and qemu that is even less useful for general purpose computing than esxi.
>Er, both KVM and Xen try to switch to paravirtualized interfaces as fast as possible, to minimize the emulation that QEMU has to do.
There are more things being emulated than there are PV drivers for, but this is a bit outside of my point.
For KVM, the vast majority of implementations are using qemu for managing their VirtIO devices as well - https://developer.ibm.com/articles/l-virtio/ - you'll notice that IBM even discusses these paravirtual drivers directly in context of "emulating" the device. Perhaps a better way to get the intent across here would be saying qemu handles the device model.
From a performance perspective, ideally you'd want to avoid PV here too and go with sr-iov devices or passthrough.
If you want data colocated on the same filesystem, then put it on the same filesystem. VMs suck, nobody spins up a whole fabricated IBM-compatible PC and gaslights their executable because they want to.[1] They do it because their OS (a) doesn't have containers, (b) doesn't provide strong enough isolation between containers, or (c) the host kernel can't run their workload. (Different ISA, different syscalls, different executable format, etc.)
Anyone who has ever tried to run heavyweight VMs atop a snapshotting volume already knows the idea of "shared blocks" is a fantasy; as soon as you do one large update inside the guest the delta between your volume clones and the base snapshot grows immensely. That's why Docker et al. has a concept of layers and you describe your desired state as a series of idempotent instructions applied to those layers. That's possible because Docker operates semantically on a filesystem; much harder to do at the level of a block device.
Is the a block containing b"hello, world" part of a program's text section, or part of a user's document? You don't know, because the guest is asking you for an LBA, not a path, not modes, not an ACL, etc. - If you don't know that, the host kernel has no idea how the page should be mapped into memory. Furthermore storing the information to dedup common blocks is non-trivial: go look at the manpage for ZFS' deduplication and it is littered w/ warnings about the performance, memory, and storage implications of dealing with the dedup table.
Straight from their site. QEMU is the user space interface, KVM the kernel space driver. It’s enough to run whatever OS. That’s the point.
For libvirt: https://libvirt.org/drivers.html
They support a bunch as well.
E.g. using the firecracker jailer: https://github.com/firecracker-microvm/firecracker/blob/main...
https://learn.microsoft.com/en-us/windows/security/applicati...
Interesting. I guess we are reading a different website.
Here's a post of someone using KVM from Python (raw, without needing a kvm library or anything): https://www.devever.net/~hl/kvm