We replaced Firecracker with QEMU

>>hugodu+(OP)
No mention of Cloud Hypervisor [1]…perhaps they don’t know about it? It’s based in part on Firecracker and supports free page reporting, virtio-blk-pci, PCI passthrough, and (I believe) discard in virtio-blk.

[1]: https://www.cloudhypervisor.org/

>>amarsh+96
We do, and we'd love to use it in the future. We've found that it's not ready for prime time yet and it's missing some features. The biggest problem was that it does not support discard operations yet. Here's a short writeup we did about VMMs that we considered: https://github.com/hocus-dev/hocus/blob/main/rfd/0002-worksp...

>>Muffin+Qc
I'd love to get a clear explanation of what libvirt actually does. As far as I can tell it's a qemu argument assembler and launcher. For my own use-case, I just launch qemu from systemd unit files:

https://wiki.archlinux.org/title/QEMU#With_systemd_service

>>hugodu+(OP)
"Firecracker's RAM footprint starts low, but once a workload inside allocates RAM, Firecracker will never return it to the host system."

Firecracker has a balloon device you can inflate (ie: acquire as much memory inside the VM as possible) and then deflate... returning the memory to the host. You can do this while the VM is running.

https://github.com/firecracker-microvm/firecracker/blob/main...

>>sheeps+vg
I'm pretty sure firecracker was literally created to underlie AWS Lambda.

EDIT: Okay, https://www.geekwire.com/2018/firecracker-amazon-web-service... says my "pretty sure" memory is in fact correct.

>>re-thc+n7
It has side channel attacks so be careful when enabling: https://pve.proxmox.com/wiki/Kernel_Samepage_Merging_(KSM)

>>arun-m+Le
I don't know if _one_ such article exists, but here is a piece of tech doc from oVirt (yet another tool) that shows how - or that - VDSM is used by oVirt to communicate with QEMU through libvirt: https://www.ovirt.org/develop/architecture/architecture.html...

In really simple terms, so simple that I'm not 100% sure they are correct:

* KVM is a hypervisor, or rather it lets you turn linux into a hypervisor [1], which will let you run VMs on your machine. I've heard KVM is rather hard to work with (steep learning curve). (Xen is also a hypervisor.)

* QEMU is a wrapper-of-a-sorts (a "machine emulator and virtualizer" [2]) which can be used on top of KVM (or Xen). "When used as a virtualizer, QEMU achieves near native performance by executing the guest code directly on the host CPU. QEMU supports virtualization when executing under the Xen hypervisor or using the KVM kernel module in Linux." [2]

* libvirt "is a toolkit to manage virtualization platforms" [3] and is used, e.g., by VDSM to communicate with QEMU.

* virt-manager is "a desktop user interface for managing virtual machines through libvirt" [4]. The screenshots on the project page should give an idea of what its typical use-case is - think VirtualBox and similar solutions.

* Proxmox is the above toolstack (-ish) but as one product.

---

[1] https://www.redhat.com/en/topics/virtualization/what-is-KVM

[2] https://wiki.qemu.org/Main_Page

[3] https://libvirt.org/

[4] https://virt-manager.org/

>>hugodu+(OP)
Someone posted this and then immediately deleted their comment: https://qemu.readthedocs.io/en/latest/system/i386/microvm.ht...

I didn't know it existed until they posted, but QEMU has a Firecracker-inspired target:

> microvm is a machine type inspired by Firecracker and constructed after its machine model.

> It’s a minimalist machine type without PCI nor ACPI support, designed for short-lived guests. microvm also establishes a baseline for benchmarking and optimizing both QEMU and guest operating systems, since it is optimized for both boot time and footprint.

>>zbroze+Yf
The main important point is that Libvirt takes care of privilege separation.

It makes sure that if your VM and/or QEMU are broken out of, there are extra layers to prevent getting access to the whole physical machine. For example it runs QEMU as a very limited user and, if you're using SELinux, the QEMU process can hardly read any file other than the vm image file.

By contrast the method in the arch wiki runs QEMU as root. QEMU is exposed to all sort of untrusted input, so you really don't want it to run as root.

Libvirt also handles cross machine operations such as live migration, and makes it easier to query a bunch of things from QEMU.

For more info see https://www.redhat.com/en/blog/all-you-need-know-about-kvm-u...

>>anthk+Rd
Not precisely, in that KSM does it after the fact while OpenVZ has it occur as a consequence of its design, on the loading of the program.

See (OpenVZ) "Containers share dynamic libraries, which greatly saves memory." It's just 1 Linux kernel when you are running OpenVZ containers.

https://docs.openvz.org/openvz_users_guide.webhelp/_openvz_c...

See (KVM/KSM): "KSM enables the kernel to examine two or more already running programs and compare their memory. If any memory regions or pages are identical, KSM reduces multiple identical memory pages to a single page. This page is then marked copy on write."

https://access.redhat.com/documentation/en-us/red_hat_enterp...

In KVM's defense, it supports a much wider range of OSes; OpenVZ only really does different versions of Linux, while KVM can run OpenBSD/FreeBSD/NetBSD/Windows and even OS/2 in addition to Linux.

>>naikro+Mh
I didn't want to go into all the technical details, but we have another write-up that goes into details about RAM management: https://github.com/hocus-dev/hocus/blob/main/rfd/0003-worksp...

Other than making sure we release unused memory to the host, we didn't customize QEMU that much. Although we do have a cool layered storage solution - basically a faster alternative to QCOW2 that's also VMM independent. It's called overlaybd, and was created and implemented in Alibaba. That will probably be another blog post. https://github.com/containerd/overlaybd

>>Muffin+Ol
There is cute article from lwn demoing using kvm directly without anything else: https://lwn.net/Articles/658511/

>>london+f4
https://www.kernel.org/doc/html/latest/admin-guide/mm/ksm.ht...

zero-copy is harder as one system upgrade on one of them will trash it, but KSM is overall pretty effective at saving some memory on similar VMs

>>foundr+Nq
KVM is a type-1 hypervisor [1]

[1]: https://www.redhat.com/en/topics/virtualization/what-is-KVM

>>foundr+Nq
According to the actual paper that introduced the distinction, and adjusting for change of terminology in the last 50 years, a type-1 hypervisor runs in kernel space and a type-2 hypervisor runs in user space. x86 is not virtualizable by a type-2 hypervisor, except by software emulation of the processor.

What actually can change is the amount of work that the kernel-mode hypervisor leaves to a less privileged (user space) component.

For more detail see https://www.spinics.net/lists/kvm/msg150882.html

>>veber-+Bw
There's arguments in both directions for something like kvm. Wiki states it pretty well:

> The distinction between these two types is not always clear. For instance, KVM and bhyve are kernel modules[6] that effectively convert the host operating system to a type-1 hypervisor.[7] At the same time, since Linux distributions and FreeBSD are still general-purpose operating systems, with applications competing with each other for VM resources, KVM and bhyve can also be categorized as type-2 hypervisors.[8]

https://en.wikipedia.org/wiki/Hypervisor#Classification

>>yjftsj+6j
As does the paper [1] with details in section 4.1.

[1]: https://www.usenix.org/system/files/nsdi20-paper-agache.pdf

>>hugodu+(OP)
At CodeSandbox we use Firecracker for hosting development environments, and I agree with the points. Though I don't think that means you should not use Firecracker for running long-lived workloads.

We reclaim memory with a memory balloon device, for the disk trimming we discard (& compress) the disk, and for i/o speed we use io_uring (which we only use for scratch disks, the project disks are network disks).

It's a tradeoff. It's more work and does require custom implementations. For us that made sense, because in return we get a lightweight VMM that we can more easily extend with functionality like memory snapshotting and live VM cloning [1][2].

[1]: https://codesandbox.io/blog/how-we-clone-a-running-vm-in-2-s...

[2]: https://codesandbox.io/blog/cloning-microvms-using-userfault...

>>nerpde+hA
It’s a common technique though. I believe it’s called oversubscription, where you rent the same hardware to more tenants hoping they won’t use it all at once.

Fly.io themselves admitted they’re oversubscribed and AWS is doing the same for years now

Source: https://fly.io/blog/the-serverless-server/

>>gwd+OZ
>Maybe it's because of the time I grew up in, but in my mind the prototypical Type-I hypervisor is VMWare ESX Server; and the prototypical Type-II hypervisor is VMWare Workstation.

My point is that these are largely appropriated terms - neither would fit the definitions of type 1 or type 2 from the early days when Popek and Goldberg were writing about them.

> Or does the thing at the bottom have to "play nice" with random other processes?

From this perspective, Xen doesn't count. You can have all sorts of issues from the dom0 side and competing with resources - you mention PV drivers later, and you can 100% run into issues with VMs because of how dom0 schedules blkback and netback when competing with other processes.

ESXi can also run plenty of unmodified linux binaries - go back in time 15 years and it's basically a fully featured OS. There's a lot running on it, too. Meanwhile, you can build a linux kernel with plenty of things switched off and a root filesystem with just the bare essentials for managing kvm and qemu that is even less useful for general purpose computing than esxi.

>Er, both KVM and Xen try to switch to paravirtualized interfaces as fast as possible, to minimize the emulation that QEMU has to do.

There are more things being emulated than there are PV drivers for, but this is a bit outside of my point.

For KVM, the vast majority of implementations are using qemu for managing their VirtIO devices as well - https://developer.ibm.com/articles/l-virtio/ - you'll notice that IBM even discusses these paravirtual drivers directly in context of "emulating" the device. Perhaps a better way to get the intent across here would be saying qemu handles the device model.

From a performance perspective, ideally you'd want to avoid PV here too and go with sr-iov devices or passthrough.

>>hamand+zh
The second I read "shared block cache" my brain went to containers.

If you want data colocated on the same filesystem, then put it on the same filesystem. VMs suck, nobody spins up a whole fabricated IBM-compatible PC and gaslights their executable because they want to.[1] They do it because their OS (a) doesn't have containers, (b) doesn't provide strong enough isolation between containers, or (c) the host kernel can't run their workload. (Different ISA, different syscalls, different executable format, etc.)

Anyone who has ever tried to run heavyweight VMs atop a snapshotting volume already knows the idea of "shared blocks" is a fantasy; as soon as you do one large update inside the guest the delta between your volume clones and the base snapshot grows immensely. That's why Docker et al. has a concept of layers and you describe your desired state as a series of idempotent instructions applied to those layers. That's possible because Docker operates semantically on a filesystem; much harder to do at the level of a block device.

Is the a block containing b"hello, world" part of a program's text section, or part of a user's document? You don't know, because the guest is asking you for an LBA, not a path, not modes, not an ACL, etc. - If you don't know that, the host kernel has no idea how the page should be mapped into memory. Furthermore storing the information to dedup common blocks is non-trivial: go look at the manpage for ZFS' deduplication and it is littered w/ warnings about the performance, memory, and storage implications of dealing with the dedup table.

[1]: https://www.youtube.com/watch?v=coFIEH3vXPw

>>bonzin+h61
>Using KVM, one can run multiple virtual machines running unmodified Linux or Windows images. Each virtual machine has private virtualized hardware: a network card, disk, graphics adapter, etc.

Straight from their site. QEMU is the user space interface, KVM the kernel space driver. It’s enough to run whatever OS. That’s the point.

For libvirt: https://libvirt.org/drivers.html

They support a bunch as well.

>>no_wiz+Cf1
Werner Vogels seems to disagree: https://twitter.com/Werner/status/25137574680

>>no_wiz+Pa
Generalized oversubscription like that is very challenging if not impossible to do securely, since you want to keep workloads isolated to single tenant numa nodes.

E.g. using the firecracker jailer: https://github.com/firecracker-microvm/firecracker/blob/main...

>>hugodu+(OP)
Presumably this doesn't use the "microvm" machine type in QEMU? (also on front page right now >>36673945 )

>>CompuI+IL
I don't know if this is relevant, but I've been intrigued by DragonflyBSD's "vkernel" [0] feature which (supposedly) allows for cloning the entire runtime state of the machine (established TCP connections, etc.) into a completely new userland memory space. I think they use it mostly for kernel debugging right now, but it's interesting to think about the possibilities of being able to just clone an entire running operating system to a new computer without interrupting even a single instruction.

[0] https://www.dragonflybsd.org/docs/handbook/vkernel/

>>london+f4
I believe we do this on Windows for Windows Sandbox. It works well but you will take a hit on performance to do the block resolution compared to always paging into physical memory.

https://learn.microsoft.com/en-us/windows/security/applicati...

>>mike_h+IE1
"Firecracker is an alternative to QEMU that is purpose-built for running serverless functions and containers safely and efficiently, and nothing more." [1]

Interesting. I guess we are reading a different website.

1. https://firecracker-microvm.github.io/

>>Muffin+Ol
> Can you use KVM/do KVM stuff without QEMU?

Here's a post of someone using KVM from Python (raw, without needing a kvm library or anything): https://www.devever.net/~hl/kvm

zlacker

We replaced Firecracker with QEMU