I recently came across a rather interesting issue that seems to be relatively unrecognised – since 18xx updates, the idling Windows guest VMs seem to be consuming about 30% of CPU on the Linux KVM host. This took me a little while to get to the bottom off, and after excluding the possibility of it being caused by any active processes from inside the VM, I eventually pinned it down to the way system timers are used.
What seems to be happening is that the Windows kernel keeps polling the CPU timer all the time at a rather aggressive rate, which manifests as rather high CPU usage on the host even though the guest is not doing any productive work. On large virtualization server, this is obviously going to pointlessly burn through a huge amount of CPU for no benefit.
The solution is to expose an emulated Hyper-V clock. For all other clocks, the kernel seems to incessantly poll the timers, but for Hyper-V it recognises that this is a bad idea in a virtual machine, and starts to behave in a more sensible way.
To achieve this, add this to your libvirt XML guest definition:
If you are old enough to remember QEMM from back in the ’90s, along with other tools we used to squeeze every last byte of memory under the 640KB limit, you may remember a rather cool feature it had – warm reboot.
What is a Warm Reboot?
Reboot involves the computer doing a Power-On Self Test (POST). This takes time, often as much as a few minutes on some servers and workstations. While you are setting something up and need to test frequently that things come up correctly at boot time, the POST can make progress painfully slow. If only we had something like the warm reboot feature that QEMM had back in the ’90s, which allowed us to reset the RAM and reboot DOS without rebooting the entire machine and suffer the POST time. Well, such a thing does actually exist in modern Linux.
kexec allows us to do exactly this – load a new kernel, kill all processes, and hand over control to the new kernel as the bootloader does at boot time. What do we need for this magic to work? On a modern distro, not much, it is all already included. Let’s start with a script that I use and explain what each component does:
Let’s look at the kexec lines first. uname -r returns the current kernel version. $(uname -r) bash syntax allows is to take the output of a command and use it as a string in the invoking command. On recent CentOS 8 here is what we get:
This is the minimum needed to boot the kernel. Once we have supplied this information, we initiate the shutdown and process purge, and hand over to the new kernel, using:
But what are the systemctl and rmmod lines about? They are mostly to work around finnickiness of Nvidia drivers and GPUs. If you execute kexec immediately, with the Nvidia driver still running, the GPU won’t reset properly and won’t get properly re-initialised by the driver when the kernel warm-boots. So we have to rmmod the nvidia driver. Legacy nvidia driver only includes the nvidia module. Newer versions also include nvidia_drm, nvidia_modeset and nvidia_uvm which depend on the nvidia module, so we have to remove those first. But before we do that, we have to make sure that Xorg isn’t running, otherwise we won’t be able to unload the nvidia driver. To make sure graphical environment isn’t running, we switch the runlevel target to multi-user.target (on a workstation we are probably running graphical.target by default). Once Xorg is no longer running, we can proceed with unloading the nvidia driver modules. And with that done, we can proceed with the warm boot and enjoy a reboot time saving.
People always seem very shocked when I suggest that virtualization comes with a very substantial performance penalty even when virtualization hardware extensions are used. Concerningly, this surprise often comes from people who have already either committed their organization’s IT infrastructure to virtualization, or have made firm plans to do so. The only thing I can conclude in these cases, unbelievable as it may appear, is that they haven’t done any performance testing of their own to assess the solution they are planning to adopt.
So I decided to document some basic performance tests that show just how substantial the performance hit of virtualization is.
Hardware: Core2 Quad 3.2GHz 8GB of RAM 2x500GB 7200rpm SATA DM RAID1 for the main system 1x250GB 7200rpm SATA for testing
Virtual Test Configuration (VMware Player 4.0.4, Xen 4.1.2 (PV and HVM), KVM (RHEL6), VirtualBox 4.1.18): CPU Cores: 4 (all) RAM: 6GB Disk: System booting off the 2×500 RAID1. Raw 250GB SATA disk passed to the VM.
Disk write caching was enabled in the VMware configuration. You may think that this unfairly gives the VM configuration an advantage, but as you will see from the results, even with this “cheat”, the performance is still very disappointing compared to bare metal. In any case, the amount of disk I/O is negligible – the caches and the working set always fit into memory.
Physical Test Configuration: CPU Cores: 4 (all) RAM: 6GB (limited using mem=6G boot parameter) Disk: Booting directly off the same 250GB SATA disk used for VM testing, with the same kernel and configuration.
The test performed is the compile of the vanilla 22.214.171.124 Linux kernel. This is the script used for testing:
make clean > /dev/null 2>&1
make mrproper > /dev/null 2>&1
echo 3 > /proc/sys/vm/drop_caches
make allmodconfig > /dev/null 2>&1
find . -type f -print0 | xargs --null cat > /dev/null
echo "Timing build..."
time (make -j16 all > /dev/null 2>&1)
The source tree is cleaned and all caches dropped. The allmodconfig configuration is used to get some degree of testing of disk I/O by creating the maximum number of files. Caches are then primed by pre-loading all the source files. This is done in order to more accurately measure the CPU and RAM subsystems without bottlenecking on disk I/O. The CPU in the system has 4 cores, and 16 build threads are used to ensure the CPU and memory I/O are saturated, but without causing enough memory pressure to cause swapping.
On the host and in the guest, all unnecessary services and processes were stopped (especially crond which could theoretically cause additional load on the system that would distort the results).
All tests were carried out 3 times in a row, and the best result for each is considered here (the differences between the runs were minimal).
This is very much a redneck, brute-force test. There isn’t much finesse to it. But I like tests like this because they cannot be cheated with the sort of smoke and mirrors illusions that virtualization software is very good at applying.
Xen 4.1.2 (PV):
VMware ESXi 5.0.0:
VMware Player 5.0.0:
VMware Player 4.0.4:
Xen 4.1.2 (HVM):
Note: No, this is not a typo – VirtualBox really is that bad.
To make this difference easier to visualise, here it is on graphs
To give a better idea of relative performance, here it is in % points, with bare metal being 100%.
The difference is substantial even with the least poorly performing hypervisor. Virtualization performance is over a 5th (21%) down with paravirtualized Xen down compared to bare metal, and nearly a quarter (24%) lower than bare metal with VMware ESXi, and even worse with KVM. Or if you prefer to look at it the other way around, bare metal is more than a quarter as fast again (26.32%) as the best performing hypervisor on the same hardware.
Don’t get me wrong – virtualization is handy for all sorts of low-performance tasks. In cases where it is used to consolidate a number of mostly idle systems into one mostly idle system, it brings clear benefits. (Except maybe in the case of VirtualBox – the performance there is just too appalling for anything, and HVM Xen is pretty poor, too.) But for uses where performance is important, thoughts of virtualizing need to undergo a serious reality check. Even if your system is designed to scale completely horizontally, requiring 26%+ of extra hardware (best case scenario, it could be a lot worse depending on which hypervisor you use) is likely to put a significant strain on your budget and running costs.
Note: It is worth stressing that these tests are carried out on hardware with VT-x, and support for this is enabled and used for all the tested hypervisors. So the results here are based on optimal hardware support.