EVGA SR-2 – Long Term Review

Having used the EVGA’s once flagship and possibly their most hyped up ever motherboard for the past two and a half years and having fought it’s many bugs and quirks extensively over that period through many uses it was supposed to, in theory, be capable of but was clearly never tested against, it seemed like a good idea to collate all the issues and workarounds into a single article. These findings have been cross-checked against multiple SR-2 motherboards.

Hardware / BIOS / POST

While there are various minor annoying bugs in the BIOS itself, I will not go into details of those and instead focus on the issues of real practical use

96GB of RAM

Xeon X5xxx series CPU specification states that each is capable of addressing 192GB of RAM. Unfortunately, EVGA SR-2 specification only states it is capable of handling up to 48GB of RAM. This is more than a little disappointing, but there is a way to persuade it to complete the POST with 96GB with 12 8GB DIMMs. You will need 12 8GB x4 dual-ranked registered DDR3 DIMMs. Insert 6 of them into the red memory slots, and boot up. Set the following:

  • MCH strap: 1600MHz
  • Memory speed: 1333
  • Manually set all the memory timings to what they were auto-detected to be
  • Set the command rate to 2T
  • No voltage increases are required just because you have 96GB – if your DIMMs are rated at 1.35V, then there is no need to set DIMM voltages higher than 1.35V.

Insert the remaining 6 DIMMs and it should now be able to boot with 96GB. The POST may take 2-3 cycles to complete, but within 30 seconds or so you should see the BIOS splash screen. Once it has booted up, a soft reboot will complete without delay. It only takes a little while on a cold boot.

Don’t expect 96GB to POST at much over 167MHz BCLK.

Unfortunately, more than 96GB will not work.

Watch out for SpeedStep Side Effects

If you enable SpeedStep but disable TurboBoost, the CPU will still boost to +1 multiplier. This is not intuitive and can cause you problems during stability testing.

Clock Generator Stability

Above 180MHz BCLK, expect to see very noisy clock signals. If you watch the clock speeds on a monitoring application, you will notice that the clock speeds will regularly spike very high and very low. This means that the stability above 180MHz BCLK is not going to be appropriate for any serious use.

Virtualization With VT-d / IOMMU

All the PCIe slots on the SR-2 are behind Nvidia NF200 PCIe bridges. Unfortunately, these have a bug in that they do not route all DMA via upstream root PCIe hub. The consequence is that when a virtual machine with PCI passthrough tries to access memory at physical range within it’s virtual sandbox that overlaps with the physical range of a PCI IOMEM area mapped to any physical device, this will be routed to the physical device rather than remapped out of the way. When this happens, at best it will result in a host crash when a physical card crashes and takes the PCIe bus down with it. At worst, the memory access will trample the region mapped to a disk controller which can easily result in garbage being written to disk – and then the host will crash anyway.

To workaround is to make sure by whatever means are available that the virtual machine does not access the area between 1GB and 4GB, which is the area reserved for mapping PCI I/O memory. Two years ago the only solution available to me was to write a patch for Xen’s hvmloader that marked that entire memory area as reserved. In theory you could also tell your guest OS to simply not use that memory (e.g. using bcdedit in Windows 7 and later to mark the area as badmem, or using mem= parameters to the Linux kernel). Today with the latest version of QEMU for Xen and KVM, you can instead use the max-ram-below-4g=1G parameter to the -machine option, which will achieve the same thing much more cleanly and with no ill side effects (such as 3GB of RAM going missing in the guest).

Note that even with this workaround, there will still be weird seemingly DMA related crashes on the SR-2 when you have VT-d enabled and you use SAS controllers. For some reason this motherboard really does not play well with them (tried three different generations of LSI, an Adaptec and a 3Ware). Some controllers will simply have no disks show up when you boot the kernel with intel_iommu=on (older LSI, Adaptec), others will seem to work but randomly crash when a VM with PCI passthrough is running (3Ware). Simple SATA controllers do not seem to suffer from this problem.

Marvell 88SE9123 SATA-3 6 GBit controller

This may nominally be a 6GBit/s SATA controller, but you should be aware that its physical upstream connection is via a x1 PCIe 2.0 lane, with a maximum throughput of 5GBit/s. That means the maximum throughput you can possibly get from both of these SATA ports (the red ones on the board) combined is about 450-500MB/s. This is something to bear in mind if you are planning to connect a pair of SSDs. You will achieve higher overall throughput by connecting the 2nd SSD to the ICH10 SATA-2 controller (the black ports on the board), even through the latter only supports up to 3GBit/s.

Overclocking with Westmere Xeons

The settings I have used with great success for the past 2.5 years, in addition to those mentioned above required for operation with 96GB of RAM are:

  • CPU Core Voltage: 1.300V. This is sufficient for up to 4GHz. You may need to go as far as 1.350V for 4.15GHz, but beyond that no voltage increase will keep things stable.
  • VTT Voltage: 1.325V. This is sufficient up to about 3.33GHz uncore speeds, which is about as far as you can realistically expect to get out of Westmere Xeons. Do not under any circumstances push this past 1.350V as it is almost guaranteed to damage the CPU regardless of how good the cooling is.
  • BCLK: <= 180MHz. My experience is that this is as far as you can go before clock frequencies start to spike all over the place. In the interest of stability, I would recommend not exceeding 177MHz, as this is where 4.8GT/s QPI setting actually equals 6.4GT/s that all the components are rated at – and there seems to be almost no headroom at all for QPI overclocking on components of this generation.

Motherboard Heatsink Fan

As far as I have been able to establish, this only seems to make any appreciable difference in cases of combined extreme BCLK overclocking, IOH over-volting, and using most if not all of the 64 PCIe lanes available through the PCIe slots. In more typical use (two PCIe x16 GPUs, 166MHz BCLK, relatively low 1.250V on the IOH), the difference between the fan being full on (approx. 5000 rpm) and completely off is around 9C (46C fully on, 55C completely off). Consequently, it may be preferable in some cases to remove the aluminium duct plate surrounding the fan, disconnect the fan, and leave the heatsink to passively cool the Intel 5520 I/O Hub, Intel ICH10 South Bridge, and Nvidia NF200 PCIe bridges. The airflow through the case caused by the case fans is likely to be more than sufficient in most if not all installations. This will also prevent the sometimes extreme yet invisible dust build-up in the fins on this heatsink under the aluminium duct plate surrounding the fan causing the temperatures to be higher than they would be if there were no active fan or duct plate present.

Linux

Hot-plug Flapping

This will show up as soon as you start the installer for any distribution you choose. You will receive a flood of messages to the console which will make the system grind to a halt. The workaround is to add pcie_ports=compat to the list of kernel boot parameters. Unfortunately, there is a device on-board that is erroneously marked as hot-pluggable and results in ASPM causing to flap between plugged and unplugged states. Disabling ASPM in the BIOS is not sufficient to fix this.

Intel HD Audio Line Mapping

This took me a while to work out, and had me thinking I had a failed audio port. The front panel connector is using an unusual port, resulting in it not producing output, and not even emitting ACPI events when something is connected and disconnected. The solution is to produce a correct map and supply it to the driver (it turns out problems like this are so common that the snd-hda-intel driver can load such a map at startup.

Simply put this in /lib/firmware/hda-jack-retask.fw:

[codec]
 0x10ec0889 0x00000000 2

 [pincfg]
 0x11 0x411111f0
 0x12 0x59a3112e
 0x14 0x01014c10
 0x15 0x01011c12
 0x16 0x01016c11
 0x17 0x01012c14
 0x18 0x01a19c40
 0x19 0x02a19c50
 0x1a 0x01813c4f
 0x1b 0x0321403f
 0x1c 0x411111f0
 0x1d 0x4015e601
 0x1e 0x01441130
 0x1f 0x01c46160

And put this in /etc/modprobe.d/hda-jack-retask.conf

options snd-hda-intel patch=hda-jack-retask.fw,hda-jack-retask.fw,hda-jack-retask.fw,hda-jack-retask.fw

That should solve the problem.

Final Words

Unfortunately, it took many man-days over the past two years to work out all this, and work out the solutions. It is not acceptable that a high-end flagship product of the sort that the SR-2 was presented to be is so buggy and require so much troubleshooting from the end customer. While the SR-2 has it’s place in history as the board that allowed for overclocking Xeons, along with the gems from a long time ago such as the A-bit BP6 which allowed dual socket operation with Celerons, in the time it took to work around all of it’s bugs it is unfortunately already deprecated, discontinued, and unsupported, and the top of the line Xeons X5690 processors are selling for little enough in the second hand market that the gains simply do not justify the effort, as appeared to be the case 2-3 years ago when starting with the several times cheaper X5650 processors.

In retrospect, when the effort is accounted for, a similar build using a pair of X5690 Xeons and a Supermicro X8DTH-6F motherboard would have almost certainly been a cheaper and less problematic experience. It might not have any overclocking functionality, but while offering the same number of PCIe x16 slots (7) and memory sockets (12), it does support 192GB of RAM (4x more than the SR-2 in the same number of sockets) without any special undocumented approaches required to make it work, and comes with an 8-port SAS controller on-board, while suffering from none of the problems above. Something that just works is usually much more economical than something that ends up requiring many days of troubleshooting effort.

Virtually Gaming, Part 2: Evolution – Consolidation and Move to KVM

In the previous article in this series, I detailed the journey to my original configuration with a single host providing multiple gaming capable virtual machines as a multi-seat workstation. But things have changed since then – many game distribution platforms such as Steam, GOG and Desura have native Linux versions, and many games have been ported to run natively on Linux. The vast majority of the ones that haven’t now work perfectly under WINE.

Consequently, the ideal solution has changed as well. In the original configuration, there were 3 seats on the system – two Windows VMs for gaming and one Linux VM for more serious use. At least one of the Windows VMs could now be removed, and it’s functionality replaced with WINE and native ports.

At the same time KVM advanced greatly in features and stability, and is now much better aligned with the requirements of this multi-seat workstation project. Perhaps most importantly, the latest QEMU even provides a feature that provides a much better workaround for the issue I had to patch Xen’s hvmloader for: max-ram-below-4g (option to the -machine parameter). Setting this to 1GB comprehensively works around the IOMMU compatibility bug of the Nvidia NF200 PCIe bridges on the EVGA SR-2, without any negative side effects.

Even better, KVM also includes patches that neuter the Nvidia driver’s ability to detect it is running in the VM (add kvm=off to the list of options passed to the -cpu parameter). That means that modifying the GPU firmware or hardware to make it appear as a Quadro or Tesla card is no longer required for using it in a virtual machine. This is a massive advantage over the original Xen solution for most people.

Summary of the most significant changes:

  • Host system updated to EL7 (CentOS)
    Required to facilitate easier running of more recent kernels and Steam (no more need to build and update an additional package set to support Steam as on EL6, including glibc). On the downside – this necessitates putting up with systemd.
  • Xen replaced by KVM
  • Windows 7 VM now uses UEFI instead of legacy BIOS
    This does away with all of legacy VGA complications such as VGA arbitration and the UEFI OVMF firmware even downloads and executes the PCI devices’ BIOS during the VM’s POST, which results in the full splash screen and even UEFI BIOS configuration menus being available during the VM boot on the external console.
  • XP x64 VM removed
    Superseded by using native Linux game ports and WINE for the rest (so far every XP compatible game I have tried works)

Some of the extra repositories I used for this are:

OVMF UEFI and SeaBIOS Firmware repository from here: https://www.kraxel.org/repos/

Mainline kernel from elrepo repository: http://elrepo.org/tiki/tiki-index.php

Bleeding edge QEMU (needed for the max-ram-below-4g option).

The full libvirt xml configuration file I use for QEMU is here:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
<name>edi</name>
<uuid>11111111-1111-1111-1111-111111111111</uuid>
<memory unit='KiB'>16777216</memory>
<currentMemory unit='KiB'>16777216</currentMemory>
<vcpu placement='static'>4</vcpu>
<sysinfo type='smbios'>
<bios>
<entry name='vendor'>GENERIC</entry>
<entry name='version'>GENERIC</entry>
<entry name='date'>01/01/2014</entry>
<entry name='release'>0.91</entry>
</bios>
<system>
<entry name='manufacturer'>GENERIC</entry>
<entry name='product'>GENERIC</entry>
<entry name='version'>GENERIC</entry>
<entry name='serial'>1</entry>
<entry name='uuid'>11111111-1111-1111-1111-111111111111</entry>
<entry name='sku'>GENERIC</entry>
<entry name='family'>GENERIC</entry>
</system>
</sysinfo>
<os>
<type arch='x86_64' machine='pc-i440fx-2.2'>hvm</type>
<boot dev='hd'/>
<smbios mode='sysinfo'/>
</os>
<features>
<acpi/>
<apic/>
<pae/>
</features>
<cpu>
<topology sockets='1' cores='4' threads='1'/>
</cpu>
<clock offset='localtime'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<devices>
<emulator>/usr/libexec/qemu-kvm</emulator>
<disk type='block' device='cdrom'>
<driver name='qemu' type='raw'/>
<target dev='hdc' bus='ide'/>
<readonly/>
<address type='drive' controller='0' bus='1' target='0' unit='0'/>
</disk>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' io='native'/>
<source dev='/dev/zvol/normandy/edi'/>
<target dev='vda' bus='virtio'/>
<serial>1</serial>
<address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
</disk>
<controller type='usb' index='0'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
</controller>
<controller type='pci' index='0' model='pci-root'/>
<controller type='ide' index='0'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
</controller>
<controller type='sata' index='0'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
</controller>
<interface type='bridge'>
<mac address='52:54:00:11:22:33'/>
<source bridge='br0'/>
<model type='virtio'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>
<hostdev mode='subsystem' type='pci' managed='no'>
<source>
<address domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='no'>
<source>
<address domain='0x0000' bus='0x07' slot='0x00' function='0x1'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='no'>
<source>
<address domain='0x0000' bus='0x0d' slot='0x00' function='0x0'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
</hostdev>
<memballoon model='virtio'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
</memballoon>
</devices>
<qemu:commandline>
<qemu:arg value='-drive'/>
<qemu:arg value='if=pflash,format=raw,readonly,file=/usr/share/edk2.git/ovmf-x64/OVMF-pure-efi.fd'/>
<qemu:arg value='-cpu'/>
<qemu:arg value='host,kvm=off'/>
<qemu:arg value='-machine'/>
<qemu:arg value='pc-i440fx-2.2,max-ram-below-4g=1G,accel=kvm,usb=off'/>
</qemu:commandline>
</domain>

The reason for the qemu:commandline section is that libvirt and especially virt-manager do not actually understand all possible QEMU parameters. The ones that they don’t support directly are in this section to avoid errors and complaints from virsh and virt-manager in normal use.

You may also notice that there are some unusual sections and values in there, so let me touch upon them in groups.

Windows Activation and Associated Checks

When you first activate Windows with a key, it keeps track of several important details of the hardware in order to detect whether the same installation has been moved into another machine. Most licenses (e.g. OEM ones) are not transferable to another machine. So in order to ensure that our installation is portable (e.g. if we upgrade to a different hypervisor at a later date), we set the various values to something static, easily memorable and predictable, so that if we ever need to migrate the VM to another host, it will not cause deactivation issues. The important settings are here (these are not in all cases complete sections, only the fragments required for this purpose, see above for the full configuration):

<uuid>11111111-1111-1111-1111-111111111111</uuid>
<sysinfo type='smbios'>
  <bios>
    <entry name='vendor'>GENERIC</entry>
    <entry name='version'>GENERIC</entry>
    <entry name='date'>01/01/2014</entry>
    <entry name='release'>0.91</entry>
  </bios>
  <system>
    <entry name='manufacturer'>GENERIC</entry>
    <entry name='product'>GENERIC</entry>
    <entry name='version'>GENERIC</entry>
    <entry name='serial'>1</entry>
    <entry name='uuid'>11111111-1111-1111-1111-111111111111</entry>
    <entry name='sku'>GENERIC</entry>
    <entry name='family'>GENERIC</entry>
  </system>
</sysinfo>
<os>
  <smbios mode='sysinfo'/>
</os>
<devices>
  <disk type='block' device='disk'>
    <serial>1</serial>
  </disk>
<devices>

Nvidia Bugs/Features Workarounds

The following sections are required in order to work around the NF200 PCIe bridge bugs (max-ram-below-4g=1G) and the Nvidia driver feature that disables GeForce GPUs in virtual machines (kvm=off):

<qemu:commandline>
  <qemu:arg value='-cpu'/>
  <qemu:arg value='host,kvm=off'/>
  <qemu:arg value='-machine'/>
  <qemu:arg value='pc-i440fx-2.2,max-ram-below-4g=1G,accel=kvm,usb=off'/>
</qemu:commandline>

CPU Configuration

<cpu>
  <topology sockets='1' cores='4' threads='1'/>
</cpu>

The reason this is important is because most non-server editions of Windows only allow up to two CPU sockets. By default QEMU presents each CPU core as being on a separate socket. That means that no matter how many CPUs you pass to your Windows VM, while they will all show up in Device Manager, only a maximum of two will be used (you can verify this using Task Manager). What the above configuration block does is instruct libvirt to tell QEMU to present four cores in a single CPU socket, so that all are usable in the Windows VM.

VFIO and Kernel Drivers

In my system I have two identical Nvidia GPUs. Numerically, the second one is primary (host), and the first one is the one I am passing to a virtual machine. I am also passing the NEC USB 3.0 controller to the VM. This is the script I wrote (in /etc/sysconfig/modules/) to bind the devices intended for the VM to the VFIO driver:

!/bin/bash
 nvidia1='lspci | grep "GTX 780 Ti" | head -1 | awk '{print $1;}`
 hda1=`echo $nvidia1 | sed -e 's/.0$/.1/'`
 nvidia2=`lspci | grep "GTX 780 Ti" | tail -1 | awk '{print $1;}'
 hda2=`echo $nvidia2 | sed -e 's/.0$/.1/'
 nec=`lspci | grep "NEC" | awk '{print $1;}'
 echo nvidia        > /sys/bus/pci/devices/0000:$nvidia2/driver_override
 echo snd-hda-intel > /sys/bus/pci/devices/0000:$hda2/driver_override
 echo vfio-pci      > /sys/bus/pci/devices/0000:$nvidia1/driver_override
 echo vfio-pci      > /sys/bus/pci/devices/0000:$hda1/driver_override
 echo vfio-pci      > /sys/bus/pci/devices/0000:$nec/driver_override
 modprobe vfio-pci
 echo 10de 1284     > /sys/bus/pci/drivers/vfio-pci/new_id
 echo 10de 0e0f     > /sys/bus/pci/drivers/vfio-pci/new_id
 echo 1033 0194     > /sys/bus/pci/drivers/vfio-pci/new_id
 echo 0000:$nvidia1 > /sys/bus/pci/devices/0000:$nvidia1/driver/unbind
 echo 0000:$hda1    > /sys/bus/pci/devices/0000:$hda1/driver/unbind
 echo 0000:$nec     > /sys/bus/pci/devices/0000:$nec/driver/unbind
 echo 0000:$nvidia1 > /sys/bus/pci/drivers/vfio-pci/bind
 echo 0000:$hda1    > /sys/bus/pci/drivers/vfio-pci/bind
 echo 0000:$nec     > /sys/bus/pci/drivers/vfio-pci/bind
 modprobe nvidia

Note that the PCI bus IDs will change if you add more hardware to the machine – that is why I wrote this script, rather than assigned the devices statically by ID. The above script works for me on my hardware – you will almost certainly need to modify it for your configuration, but it should at least give you a reasonable idea of the approach that works.

Important: The devices this identifies have to match what is in your libvirt XML config file in the relevant hostdev sections. You will have to adjust that manually for your configuration, either using virsh edit or virt-manager.

Also depending on your hardware, you may need to do the initial Windows installation on the emulated GPU rather than the real one (e.g. if you are using a USB controller for the VM that requires additional drivers, as is the case with the USB 3.0 controller I am using for my VM). Otherwise you will get display output but be unable to use your keyboard/mouse during the installation.

Gaming on Linux: Steam

Pre-packaged Steam binary used to be available form the rpmfusion repository, but this no longer appears to be there. Thankfully, there is also a maintained negativo17’s repository for Steam for Fedora 20+, which installs and runs fine on EL7. You may also need to grab a few RPMs from Fedora 19 because EL7 doesn’t ship with a full complement of 32-bit libraries. The ones I found I needed are these:

libbsd-0.6.0-3.fc19.i686
libtxc_dxtn-1.0.0-3.fc19.i686
libxkbcommon-0.3.0-1.fc19.i686
openal-soft-1.16.0-2.fc19.i686
SDL2-2.0.3-1.fc19.i686
SDL2_image-2.0.0-4.fc19.i686

The reason these are from Fedora 19 is because F19 is virtually identical in terms of package versions to EL7.

Typically, the Steam RPM installation is a one-off, mostly to bootstrap the initial run, and install the dependencies. After that, a local version of Steam will be installed in the user’s home directory in ~/.local/share/Steam/. In light of the recent Steam bug resulting in deletion of the user’s entire home directory, I implemented a solution that runs Steam as a separate steam user, from that user’s own home directory. That way should anything similar to this ever happen, the only thing that would be deleted is the steam user’s home directory rather than any important files not related to running Steam games.

To do this, you will need to add a steam user, and give it necessary permissions:

$ sudo adduser steam
$ sudo usermod -a -G audio,games,pulse-access,video steam

Add the following to /etc/sudoers.d/steam:

%games ALL = (steam) NOPASSWD: /bin/steam

Create the following script (e.g. /usr/local/bin/steam.sh):

!/bin/bash
 xhost +SI:localuser:steam
 chgrp audio /run/user/$UID /run/user/$UID/pulse
 chmod 750 /run/user/$UID /run/user/$UID/pulse
 sudo -u steam /usr/bin/steam
 sudo -u steam pkill dbus-launch

From there on, when you invoke steam.sh, it will launch steam as the steam user, and pass the graphical output to the Xorg session of the logged in user. The net result is that any potentially damaging bug in Steam or associated games can only do damage to the files owned by the steam user. This security model is not dissimilar to the Android security model where every application runs under it’s own user, for similar security reasons.

Gaming on Linux: WINE

There are two obvious options for this:

1) PlayOnLinux

2) More traditional WINE (I use the one from DarkPlayer’s repository)

I only had to make one configuration change to WINE, and that is to disable the dwrite.dll library in WINE (to disable it, run winecfg, go to Libraries -> add dwrite.dll, edit dwrite.dll entry and set it to disabled). I am using XP version emulation, which isn’t even supposed to include dwrite.dll, and the problem it causes is that fonts are invisible in Steam and some other applications.

End Result

The end result is a much cleaner virtual machine configuration: e.g. no missing RAM like before with Xen, due to the NF200 bug workaround, and no need for hardware modification of my GeForce cards. The performance seems very smooth, and so far the entire setup has been completely trouble free.

There is also one fewer virtual machine and one fewer GPU in the system without any loss of functionality. Should I require an additional seat in the future, it will most likely be a Linux one, and implemented using a Xorg multi-seat configuration.

Virtually Gaming, Part 1: In the Beginning – Hardware and Xen

For about two years now I have managed to stick to the “No Windows on bare metal.” policy. This was instated for many reasons, including security and ease of backups (it is difficult to beat ZFS snapshots and send/receive functionality). The key reason for using Windows at all has been gaming, and both myself and my wife do play various games, mostly of the co-op FPS genre. While native Linux support has increased dramatically in that time, the availability of native Linux games still hasn’t quite reached parity with availability on the Windows platform.

Combining the “No Windows on bare metal.” policy with the requirement for high performance gaming capability meant that the only solution that fits is PCI passthrough of a high end GPU to the virtual machine. In this article I will describe the journey to the solution over the past two years, including (often unfortunate) choices of hardware, software, working around hardware, firmware, driver and software bugs, crippling and limitations, and other bumps on the road to virtualized gaming.

Hardware

When I first embarked on this project, it was an off-shoot of the project to upgrade my workstation. While there was nothing wrong with my Quad Core 2 in terms of performance, I needed to get a second machine up and running for my wife. So, somewhat optimistically, I thought this would be an ideal opportunity to solve three problems at the same time:

  1. Get a gaming grade workstation up and running for my wife
  2. Virtualize the Windows part of my dual-boot setup so I never have to reboot for the sake of joining a game when my friends invite me
  3. Implement the “No Windows on bare metal.” policy

The motherboard that caught my eye was the EVGA SR-2. It seemed to fit all of the necessary requirements:

  1. Plenty of CPU power (dual socket, capable of taking up to two 6-core Xeons)
  2. Full support for plenty of ECC memory (after the last build I have vowed to never build another machine without ECC RAM, having spent days troubleshooting a stock-setting stability issue that turned out to be marginal memory)
  3. Plenty of PCIe slots (7 x16 slots, with 64 usable PCIe lanes between them)
  4. VT-d support (originally listed in the spec, and confirmed with EVGA tech support prior to purchase – a claim that turned out to be rather stretching the truth)

A sizable investment into the motherboard, a pair of 6-core X5650 Xeons, and 48GB (6x8GB) of registered ECC RAM later, problems began.

Hardware Problems

The first motherboard I got turned out to have a faulty PCIe slot #1. The retailer I bought the motherboard from went bust a few weeks after my purchase, but EVGA generally have excellent RMA service, and I registered the motherboard as soon as I had received it to qualify for the full 10 year manufacturer’s warranty that was offered on this motherboard.

In order to not put my build on hold, before I RMA-ed the faulty SR-2, I bought another, second hand SR-2 on eBay. I thoroughly tested it, and to this day, this is the SR-2 that has been completely fault free in the main workstation that was the product of this project. It turns out, I was quite lucky to have bought a second motherboard – because the replacement that was sent was also faulty, and failed to reliably finish POST-ing with either of my CPUs in either socket. That got RMA-ed as well, and the replacement is currently in use as a prototyping rig for the next incarnation of this workstation, but that motherboard also has problems which cause it to fail to boot on a hot reboot (I am putting off RMA-ing it until the prototyping stage of the project is completed and being without a working prototyping machine for a week won’t be a problem.

In conclusion: Beware EVGA warranty replacement motherboards – they are all refurbished items that were sent back as faulty, and either repaired or the fault was never reproducible by their testing team so they got recycled as is. Always test any refurbished replacements extremely thoroughly (all slots, sockets, ports and features) when you receive them – if you get a faulty replacement, EVGA will pay for the shipping costs back to them for another replacement, but only within the first month after you receive the replacement, so acting quickly and thoroughly is of vital importance to avoid courier costs that can quickly add up to a lot.

More RAM

At this time I looked into using 96GB of RAM on the SR-2. This turned out to be very difficult as the machine would generally refuse to POST, except after a fresh CMOS reset. This was particularly annoying because the CPUs themselves (which contain the MCH) officially support 192GB of RAM each. After a lot of trial and error, I found a way to make the machine reliably post with 96GB of RAM:

  1. Use dual-ranked (this is important, single ranked won’t work for 96GB!) x4 registered 1600MHz 1.35V DIMMs
  2. Boot the machine with only 6 DIMMs. Go to the memory settings, and manually set all of the memory timings to what they defaulted to. Make sure you set the command rate to 2T (defaults to 1T).
  3. If you are overclocking, make sure you set the MCH strap to 1600MHz.

Do this and your SR-2 should POST with 96GB. It may require a few attempts where the motherboard re-sets itself and re-attempts the POST, but both of mine successfully POST within 30 seconds.

All of the symptoms indicate that there is a BIOS bug in timeouts at various stages of the POST that cause some initializations to fail and time out when more than 48GB of RAM is used. Officially, EVGA only claim the SR-2 supports up to 48GB of RAM, and it is unlikely they will be fixing this BIOS bug.

Hypervisor

Back when I began this project (late 2012), the only hypervisor with notable reports of GPU passthrough success without requiring a lot of manually applied experimental patches was Xen, so this was what I chose for the project. Additionally, my previous tests indicated that the performance overheads of using Xen were among the lowest of all the available hypervisors, so it seemed like a win-win situation.

The primary GPU in the machine was an ageing but perfectly adequate GeForce 8800GT that came from my previous workstation. Then I had to select a suitable GPU for passthrough to a virtual machine. Nvidia passthrough only worked on expensive Quadro (and not all Quadros, only the expensive ones), Tesla and Grid cards which they refer to as “MultiOS compatible”. The cost of most of those made them not an option worth considering. That meant trying an ATI card, so I got a cheap passively cooled single-slot Radeon HD6450. This is where a whole array of real problems began:

  1. EVGA SR-2 motherboard uses a pair of NF200 PCIe bridges to multiplex 32 PCIe lanes available on the upstream Intel 5520 PCIe hub into 64 PCIe lanes available for GPUs. NF200 bridges have severe bugs and limitations when it comes to compatibility with VT-d. They bypass IOMMU for DMA transfers, so when the VM tries to access RAM within it’s virtual address space that overlaps the physical address of a PCI BAR (aperture) that belongs to a hardware device, the memory writes will hit the BAR, which will crash the machine (and maybe corrupt your disks, if the BAR being trampled belongs to a disk controller). The solution to this was to write a hvmloader patch that marked all of the IOMEM areas from the host as reserved. This was an ugly bodge that resulted in a fair amount of memory in the domU (what Xen calls guest VMs) becoming unusable, but it worked (and with enough RAM it wasn’t a major problem).
  2. More than likely related to point 1, this motherboard appears to have broken (or non-existent) support for interrupt remapping, which means that any devices passed to a VM have to have dedicated, unshared interrupts. If you pass a device sharing an interrupt to the VM, the VM will most likely crash the entire host. Problems 1 and 2 are very similar in symptoms (host crash), which made them quite difficult to troubleshoot and get to the bottom of because no one change to the configuration made the problem go away. It took some help from the Xen developers and a fair amount of guesswork to figure it all out. The only solution is to move cards around to different slots until all of the hardware you intend to pass through to virtual machines has dedicated interrupts that aren’t shared with other hardware. This can be fiddly, but it is generally achievable – in my final configuration, I am successfully passing two GPUs and three USB controllers to VMs.
  3. ATI cards suffer from terrible drivers that fail to re-initialize the card without full BIOS level re-POST-ing (and said re-POST-ing doesn’t happen when the VM is rebooted, only when the entire physical machine is rebooted). The consequence is that they work OK when the VM is first booted up after a host reboot, but subsequent VM reboots result in massive performance degradation, glitches, and sometimes complete host crashes. While some of this is being worked on (e.g. functionality to reset the GPU via a bus reset from Xen dom0), it is still not available in the current released version. This particular problem turned out to not be easily solvable (having already written a patch for Xen’s hvmloader, I was very keen to avoid having to write any more to implement PCI bus resetting functionality for the Xen pci-stub driver. To at least be able to prove the concept, I bought the cheapest Nvidia Quadro that is supported for GPU passthrough (Quadro 2000), and this worked absolutely fine. Having finally found a solution that works perfectly, I went on to find ways of making GeForce cards work with PCI passthrough through fooling the Nvidia driver into initializing them even though they weren’t expensive enough, by modifying the cards’ ID number into an equivalent Quadro card. As discussed in previous articles, Nvidia cards up to and including the Fermi generation can be modified into equivalent Quadro cards by changing the appropriate ID strap bits in the cards’ BIOS using nvflash. Kepler cards require a small hardware modification. The easiest modifications are GTX 680 to Tesla K10 (remove one resistor) and GTX780Ti to Quadro K6000 (add one large, easy to solder resistor across appropriate pins on the EEPROM). I am currently running a pair of GTX 780Ti cards.

Issues 1 and 2 listed above are why I said that claiming the SR-2 supports VT-d was seriously stretching the truth. On a well designed workstation motherboard, the above problems should never have arisen. After all that, and many, many man-days invested in it working around the various bugs mentioned above, I have Xen working on the system, with EL6 (CentOS) dom0, and two domUs, one running XP x64 and one running Windows 7 x64. The hardware passed through on PCIe level is:

XP x64:

  • Intel ICH10 HD Audio
  • 2x ICH10 USB
  • GeForce GTX 780Ti

Windows 7 x64:

  • NEC USB 3 controller
  • GeForce GTX 780Ti

GRUB options:

 kernel /xen.gz noreboot unrestricted_guest=1 msi=1
 module <kernel and options> intel_iommu=on pcie_ports=compat

Note that unrestricted_guest=1 and pcie_ports=compat are required on the SR-2, but may not be required if you hardware behaves better. If your IOMMU implementation is good and includes ACS functionality, you shouldn’t need unrestricted_guest=1.

pcie_ports=compat is required because without it the SR-2 makes the PCI hotplug driver flap very quickly on one of the PCI devices built into the south bridge chipset, which causes an interrupt flood that makes the machine grind to a halt. (Have I mention enough times yet that the SR-2 is extremely buggy?)

Xen domU config:

name="mydomu"
description="None"
uuid="a57e6840-e9f5-4a14-a822-abcdef012345"
memory=16384
maxmem=16384
vcpus=6
on_poweroff="destroy"
on_reboot="restart"
on_crash="destroy"
localtime=1
keymap="en-gb"

builder="hvm"
device_model_override="/usr/lib/xen/bin/qemu-dm"
device_model_version="qemu-xen-traditional"
boot="c"
disk=[ '/dev/zvol/ssd/mydomu,raw,hda,rw', '/dev/sr0,raw,hdc:cdrom,rw' ]
vif=[ 'mac=00:11:22:33:44:55,bridge=br0,model=e1000', ]
stdvga=1
usb=1
acpi=1
apic=1
pae=1
gfx_passthru=0
pci = [ '07:00.0', '07:00.1', '00:1b.0', '00:1a.1' ]
xen_platform_pci=1
pci_msitranslate=0
pci_power_mgmt=1

Obviously you will need to change things like PCI addresses, MAC addresses, block device paths, and suchlike to suit your own system.

/etc/modprobe.d/xen-pciback.conf:

options xen-pciback permissive=1 hide=(07:00.0)(07:00.1)(00:1b.0)(00:1a.1)

Note the PCI IDs in the xen-pciback module options correspond to the PCI IDs in the Xen domU configuration. You may not need permissive=1 if you have better hardware than I do.

And Another Thing

One thing I feel I have to mention is that I have had extremely bad experience with every SAS card I have tried to use in the SR-2 with virtualization. This includes two different LSI cards, an Adaptec card and a 3ware card. They all work fine in a normal bare metal setup, and cause all kinds of crash inducing problems, some more difficult to debug than others when IOMMU is enabled and VMs are running with PCI devices passed through to them. SATA cards (I tried Silicon Image and Marvell), OTOH, seem to always work just fine, with no problems whatsoever, including when using 1:5 SATA port multipliers. In some cases this is caused by the SAS controller being a native PCI-X chip and using a phantom PCI-X to PCIe bridge. In other cases it seems to be caused by the SAS card’s driver trying to do some interesting DMA accesses that crash the entire host when virtual machines are running with PCI devices passed through to them. In short – avoid using SAS cards and stick with SATA – but then again I find that to be good advice to follow regardless.

This setup has worked without any significant problems for the past two years. But things have changed in that time. There is now a native Linux version of Steam, and many games have native Linux ports. It is time that this long term reliable system is updated accordingly. More on that in the next article.