Virtualized Windows 10 Idle CPU Consumption

I recently came across a rather interesting issue that seems to be relatively unrecognised – since 18xx updates, the idling Windows guest VMs seem to be consuming about 30% of CPU on the Linux KVM host. This took me a little while to get to the bottom off, and after excluding the possibility of it being caused by any active processes from inside the VM, I eventually pinned it down to the way system timers are used.

Diagnosis

What seems to be happening is that the Windows kernel keeps polling the CPU timer all the time at a rather aggressive rate, which manifests as rather high CPU usage on the host even though the guest is not doing any productive work. On large virtualization server, this is obviously going to pointlessly burn through a huge amount of CPU for no benefit.

Solution

The solution is to expose an emulated Hyper-V clock. For all other clocks, the kernel seems to incessantly poll the timers, but for Hyper-V it recognises that this is a bad idea in a virtual machine, and starts to behave in a more sensible way.

To achieve this, add this to your libvirt XML guest definition:

<features>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <synic state='on'/>
      <stimer state='on'/>
    </hyperv>
</features>

This gets the VM’s idle CPU usage from 30%+ down to a much more reasonable 1%.

Warm Reboot on Linux with kexec (Remember QEMM?)

If you are old enough to remember QEMM from back in the ’90s, along with other tools we used to squeeze every last byte of memory under the 640KB limit, you may remember a rather cool feature it had – warm reboot.

What is a Warm Reboot?

Reboot involves the computer doing a Power-On Self Test (POST). This takes time, often as much as a few minutes on some servers and workstations. While you are setting something up and need to test frequently that things come up correctly at boot time, the POST can make progress painfully slow. If only we had something like the warm reboot feature that QEMM had back in the ’90s, which allowed us to reset the RAM and reboot DOS without rebooting the entire machine and suffer the POST time. Well, such a thing does actually exist in modern Linux.

Enter kexec

kexec allows us to do exactly this – load a new kernel, kill all processes, and hand over control to the new kernel as the bootloader does at boot time. What do we need for this magic to work? On a modern distro, not much, it is all already included. Let’s start with a script that I use and explain what each component does:

#!/bin/bash

systemctl isolate multi-user.target

rmmod nvidia_drm nvidia_modeset nvidia_uvm
rmmod nvidia

kexec --load=/boot/vmlinuz-$(uname -r) \
      --initrd=/boot/initramfs-$(uname -r).img \
      --command-line="$(cat /proc/cmdline)"

kexec --exec

Let’s look at the kexec lines first. uname -r returns the current kernel version. $(uname -r) bash syntax allows is to take the output of a command and use it as a string in the invoking command. On recent CentOS 8 here is what we get:

$ uname -r
4.18.0-193.6.3.el8_2.centos.plus.x86_64
$ echo $(uname -r)
4.18.0-193.6.3.el8_2.centos.plus.x86_64

The kernel and initial ramdisk usually have the kernel version in their names in /boot/:

$ ls /boot/
initramfs-4.18.0-193.6.3.el8_2.centos.plus.x86_64.img
vmlinuz-4.18.0-193.6.3.el8_2.centos.plus.x86_64

So in our warm reboot script, vmlinuz-$(uname -r) will expand to vmlinuz-4.18.0-193.6.3.el8_2.centos.plus.x86_64. Similar will happen with the initramfs file name.

Next, what is in /proc/cmdline ? This contains the boot parameters that our currently running kernel was booted with, as provided in our grub conifguration, for example:

$ cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos2)/vmlinuz-4.18.0-193.6.3.el8_2.centos.plus.x86_64 root=ZFS=tank/ROOT quiet elevator=deadline transparent_hugepage=never

This is the minimum needed to boot the kernel. Once we have supplied this information, we initiate the shutdown and process purge, and hand over to the new kernel, using:

kexec --exec.

But what are the systemctl and rmmod lines about? They are mostly to work around finnickiness of Nvidia drivers and GPUs. If you execute kexec immediately, with the Nvidia driver still running, the GPU won’t reset properly and won’t get properly re-initialised by the driver when the kernel warm-boots. So we have to rmmod the nvidia driver. Legacy nvidia driver only includes the nvidia module. Newer versions also include nvidia_drm, nvidia_modeset and nvidia_uvm which depend on the nvidia module, so we have to remove those first. But before we do that, we have to make sure that Xorg isn’t running, otherwise we won’t be able to unload the nvidia driver. To make sure graphical environment isn’t running, we switch the runlevel target to multi-user.target (on a workstation we are probably running graphical.target by default). Once Xorg is no longer running, we can proceed with unloading the nvidia driver modules. And with that done, we can proceed with the warm boot and enjoy a reboot time saving.

Hardware Accelerated SSL on SheevaPlug (Marvell Kirkwood ARM) Using OpenSSL on Fedora

I have recently been spending a quite a lot of time working on Linux on various ARM devices. It is quite amazing what ARM hardware is capable of nowdays. One of the most popular ARM based machines available is the SheevaPlug. The performance of it is pretty good for a small server – my experience shows that the 1.2GHz Marvell Kirkwood 88F6281 compares quite favoutably to the likes of 1.66GHz Intel Atom N450 in terms of both server performance and especially in terms power usage. Atom N450 systems have a typical power draw of about 22W idle and 28W under load – a far cry from the supposed 7.6W total of 5.5W N450 + 2.1W NM10. SheevaPlug, on the other hand, draws 2.3W idle and 7W under load.

In some areas, however, the Atom does hold a performance advantage, especially in usage that requires heavy number crunching – unlike the Marvell KirkwoodAtom N450 has a FPU and SIMD capability via the SSE/SSE2/SSSE3 instruction sets. One set of applications that get better performance on Atom N450 are the ones doing encryption, for example OpenSSL. Or do they…

Not quite. The Kirkwood ARM has an ace up it’s sleeve, and as it turns out, it is one powerful enough to allow it to close the gap against a processor with 4x the power budget. It has a hardware crypto engine that supports MD5, SHA1 and AES-128 acceleration.

Unfortunately, mainstream Linux distributions don’t come with the hardware crypto acceleration enabled, and most of the documentation available is sufficiently out of date to be unapplicable to the current generation of distributions. All of it points at OCF Linux, which hasn’t been updated for kernels past 2.6.33 and OpenSSL 0.9.8n, both of which are deprecated. I have modified the kernel patches to make them work on 2.6.35, but unfortunately the cryptodev driver uses locked ioctl operation which has been removed from the kernel starting with 2.6.36, so further modifications are required to make it work on later kernels. OCF Linux also doesn’t appear to have been updated since late 2010. But things are not as bad as it initially seems – it turns out that there is an alternative.

The reason kernel patches are required is because acceleration depends on the BSD style cryptodev kernel interface. There is an alternative, more up to date project that provides this much less intrusively: Cryptodev-linux. It provides a standalone driver that doesn’t require the entire kernel to be recompiled for it, and it works with the 2.6.36+ kernels.

That just leaves OpenSSL support. Well, it turns out that OpenSSL 1.0.0 already comes with support for cryptodev hardware offload, it just isn’t enabled by default. It has to be enabled during the configure stage by providing -DHAVE_CRYPTODEV (for encryption offload) and -DUSE_CRYPTODEV_DIGESTS (for hashing offload). If you are building against Cryptodev-linux you will also have to provide the -DHASH_MAX_LEN=64 parameter – this is normally in OCF‘s cryptodev.h header file, but isn’t present in the header files that Cryptodev-linux provides. Not a big deal, but something to bear in mind when you are building your own OpenSSL with cryptodev engine support.

So, how big a difference does the Kirkwood‘s acceleration make? Quite a substantial one. Here is what openssl speed test produces:

Kirkwood without cryptodev:
# openssl speed -evp aes-128-cbc
Doing aes-128 cbc for 3s on 16 size blocks: 1870065 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 64 size blocks: 516074 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 256 size blocks: 132474 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 1024 size blocks: 33342 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 8192 size blocks: 4171 aes-128 cbc’s in 3.00s

Kirkwood with cryptodev:
# openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 85277 aes-128-cbc’s in 0.08s
Doing aes-128-cbc for 3s on 64 size blocks: 82960 aes-128-cbc’s in 0.08s
Doing aes-128-cbc for 3s on 256 size blocks: 59806 aes-128-cbc’s in 0.03s
Doing aes-128-cbc for 3s on 1024 size blocks: 40939 aes-128-cbc’s in 0.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 8227 aes-128-cbc’s in 0.00s

The results show, predictably, that with very small (unrealistically small) data blocks, software-only userspace crypto is faster due to less context switching. With 1KB blocks, however, hardware crypto is 23% faster, and with 8KB blocks the hardware engine goes twice as fast as the software-only option. But what is really impressive is the reduction in CPU time. Because the hardware crypto engine is asynchronous, there is practically no CPU time required when using it, which is important since it leaves the CPU free to get on with other tasks.

For comparison, there are the Atom N450 results:

# openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 3813930 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 1098375 aes-128-cbc’s in 2.99s
Doing aes-128-cbc for 3s on 256 size blocks: 294884 aes-128-cbc’s in 2.99s
Doing aes-128-cbc for 3s on 1024 size blocks: 74520 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 9245 aes-128-cbc’s in 2.99s

So the Atom is faster all around – on 1KB blocks it is 82% faster, which reduces to a 12% advantage using 8KB blocks. But let us not forget that we could, in theory, run two instances of OpenSSL, one with hardware offload and one without, which would give us the combined total performance of both, if that is all we needed the machine to do. This would give us figures of approximately:

1KB: 33342+40939=74281
8KB: 4171+8227=12398

This ties with the Atom using 1KB blocks, and beats it by 34% using 8KB blocks – all in a power envelope 4x smaller. Pretty impressive.

Installing Cryptodev-linux is trivially simple, and is simply a matter of the usual “make; make install” procedure after extracting the tar ball (make sure you have the kernel headers for your kernel installed and available in /lib/modules/$(uname -r)/build/).

I mentioned above the required additional parameters to make OpenSSL build with cryptodev support. On Fedora 13’s OpenSSL‘s source package, you can edit the relevant line in the spec file. The relevant section on my version reads:

./Configure –prefix=/usr –openssldir=%{_sysconfdir}/pki/tls ${sslflags} zlib enable-camellia enable-seed enable-tlsext enable-rfc3779 enable-cms enable-md2 no-idea no-mdc2 no-rc5 no-ec no-ecdh no-ecdsa –with-krb5-flavor=MIT –enginesdir=%{_libdir}/openssl/engines –with-krb5-dir=/usr -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS -DHASH_MAX_LEN=64 shared threads ${sslarch} fips

In case you cannot modify/build it yourself, here are the packages:
https://altechnative.net/wp-content/uploads/2011/05/openssl-1.0.0-1.kw.fc13.src.rpm
https://altechnative.net/wp-content/uploads/2011/05/openssl-1.0.0-1.kw.fc13.armv5tel.rpm
https://altechnative.net/wp-content/uploads/2011/05/openssl-devel-1.0.0-1.kw.fc13.armv5tel.rpm