Hardware Accelerated SSL on ARM – Redux

A long time ago, I posted an article about advantages of hardware accelerated SSL encryption, and how to get it working on Fedora Linux. Since then, some things have improved, and some things have regressed.

Improvements:

Regressions:

  • RedHat have broken OpenSSH with their audit patch. This is particularly inconsistent with the fact that the distro supplied openssh package in EL6 is built with the –with-ssl-engine option, to enable support for hardware crypto acceleration, yet this is clearly completely untested, which begs the question of what the point of it is.

Thankfully, the regression mentioned above can be fixed to make sshd work properly with hardware crypto offload.

Here are links to patched OpenSSL and OpenSSH packages for EL6, current at the time of writing this article:

http://ftp.redsleeve.org/pub/el6-staging/packages/soc/SRPMS/openssl-1.0.1e-30.el6.11.cryptodev.src.rpm

http://ftp.redsleeve.org/pub/el6-staging/packages/soc/SRPMS/openssh-5.3p1-104.el6.1.cryptodev.src.rpm

While ssh with using the blowfish algorithm in software is very fast and good enough for general purpose ssh usage, for some operations, such as transferring ZFS snapshots over ssh, using hardware offloaded AES provides a very welcome performance boost, because it leaves more CPU available for other processes.

ZFS-FUSE 0.7.1 Released

The last official release of zfs-fuse was years ago, and it was seriously starting to fall behind other implementations. It was effectively abandoned, which is quite inconvenient considering it is still the only viable option on 32-bit Linux installations (e.g. on ARM or those who are still tied to i686 for legacy reasons).

Since I use Linux on ARM heavily, I have been working on changing this for the past few weeks. The last official release 0.7.0 was made by Seth Heeren a few years ago, and this supported ZFS pool versions up to v23. Emmanuel Anne was maintaining an unofficial post-0.7.0 branch that had support for pool versions up to v26 added. Over the past couple of years, other people have contributed a few patches here and there (manual ashift setting at boot time, some patches to add support for ARM, a couple of patches maintained out of tree shipped with the Fedora package). Over the past few weeks I need a few additional features that have existed in other implementations, particularly for running a root file system on it (mount.zfs for legacy mount points, and better systemd/initramfs), so I added those features. It also transpired that a few of the patches that made it into the official 0.7.0 release weren’t in Emmanuel’s code tree since it was forked before the official 0.7.0 release. I located and backported those from Seth’s maint branch on github.

With all this done, and with no other volunteers showing any interest in further maintaining zfs-fuse, it seems to have fallen to me to make the decision to take the 0.7.1 release. I have tested this extensively on my ARM systems with pools of various sizes (16GB to 16TB) and complexities (single disk to RAIDZ2) and it has been very stable.

If you are stuck on a 32-bit Linux platform and would love the features of ZFS, you can find the latest release of zfs-fuse on on github:

https://github.com/gordan-bobic/zfs-fuse

Future work will include adding support for additional pool versions. I have already created branches for those, but, this will need extensive testing before I deem it stable enough for a release. If you are interested in helping with either development or testing of zfs-fuse, please, do get in touch.

EVGA SR-2 – Long Term Review

Having used the EVGA’s once flagship and possibly their most hyped up ever motherboard for the past two and a half years and having fought it’s many bugs and quirks extensively over that period through many uses it was supposed to, in theory, be capable of but was clearly never tested against, it seemed like a good idea to collate all the issues and workarounds into a single article. These findings have been cross-checked against multiple SR-2 motherboards.

Hardware / BIOS / POST

While there are various minor annoying bugs in the BIOS itself, I will not go into details of those and instead focus on the issues of real practical use

96GB of RAM

Xeon X5xxx series CPU specification states that each is capable of addressing 192GB of RAM. Unfortunately, EVGA SR-2 specification only states it is capable of handling up to 48GB of RAM. This is more than a little disappointing, but there is a way to persuade it to complete the POST with 96GB with 12 8GB DIMMs. You will need 12 8GB x4 dual-ranked registered DDR3 DIMMs. Insert 6 of them into the red memory slots, and boot up. Set the following:

  • MCH strap: 1600MHz
  • Memory speed: 1333
  • Manually set all the memory timings to what they were auto-detected to be
  • Set the command rate to 2T
  • No voltage increases are required just because you have 96GB – if your DIMMs are rated at 1.35V, then there is no need to set DIMM voltages higher than 1.35V.

Insert the remaining 6 DIMMs and it should now be able to boot with 96GB. The POST may take 2-3 cycles to complete, but within 30 seconds or so you should see the BIOS splash screen. Once it has booted up, a soft reboot will complete without delay. It only takes a little while on a cold boot.

Don’t expect 96GB to POST at much over 167MHz BCLK.

Unfortunately, more than 96GB will not work.

Watch out for SpeedStep Side Effects

If you enable SpeedStep but disable TurboBoost, the CPU will still boost to +1 multiplier. This is not intuitive and can cause you problems during stability testing.

Clock Generator Stability

Above 180MHz BCLK, expect to see very noisy clock signals. If you watch the clock speeds on a monitoring application, you will notice that the clock speeds will regularly spike very high and very low. This means that the stability above 180MHz BCLK is not going to be appropriate for any serious use.

Virtualization With VT-d / IOMMU

All the PCIe slots on the SR-2 are behind Nvidia NF200 PCIe bridges. Unfortunately, these have a bug in that they do not route all DMA via upstream root PCIe hub. The consequence is that when a virtual machine with PCI passthrough tries to access memory at physical range within it’s virtual sandbox that overlaps with the physical range of a PCI IOMEM area mapped to any physical device, this will be routed to the physical device rather than remapped out of the way. When this happens, at best it will result in a host crash when a physical card crashes and takes the PCIe bus down with it. At worst, the memory access will trample the region mapped to a disk controller which can easily result in garbage being written to disk – and then the host will crash anyway.

To workaround is to make sure by whatever means are available that the virtual machine does not access the area between 1GB and 4GB, which is the area reserved for mapping PCI I/O memory. Two years ago the only solution available to me was to write a patch for Xen’s hvmloader that marked that entire memory area as reserved. In theory you could also tell your guest OS to simply not use that memory (e.g. using bcdedit in Windows 7 and later to mark the area as badmem, or using mem= parameters to the Linux kernel). Today with the latest version of QEMU for Xen and KVM, you can instead use the max-ram-below-4g=1G parameter to the -machine option, which will achieve the same thing much more cleanly and with no ill side effects (such as 3GB of RAM going missing in the guest).

Note that even with this workaround, there will still be weird seemingly DMA related crashes on the SR-2 when you have VT-d enabled and you use SAS controllers. For some reason this motherboard really does not play well with them (tried three different generations of LSI, an Adaptec and a 3Ware). Some controllers will simply have no disks show up when you boot the kernel with intel_iommu=on (older LSI, Adaptec), others will seem to work but randomly crash when a VM with PCI passthrough is running (3Ware). Simple SATA controllers do not seem to suffer from this problem.

Marvell 88SE9123 SATA-3 6 GBit controller

This may nominally be a 6GBit/s SATA controller, but you should be aware that its physical upstream connection is via a x1 PCIe 2.0 lane, with a maximum throughput of 5GBit/s. That means the maximum throughput you can possibly get from both of these SATA ports (the red ones on the board) combined is about 450-500MB/s. This is something to bear in mind if you are planning to connect a pair of SSDs. You will achieve higher overall throughput by connecting the 2nd SSD to the ICH10 SATA-2 controller (the black ports on the board), even through the latter only supports up to 3GBit/s.

Overclocking with Westmere Xeons

The settings I have used with great success for the past 2.5 years, in addition to those mentioned above required for operation with 96GB of RAM are:

  • CPU Core Voltage: 1.300V. This is sufficient for up to 4GHz. You may need to go as far as 1.350V for 4.15GHz, but beyond that no voltage increase will keep things stable.
  • VTT Voltage: 1.325V. This is sufficient up to about 3.33GHz uncore speeds, which is about as far as you can realistically expect to get out of Westmere Xeons. Do not under any circumstances push this past 1.350V as it is almost guaranteed to damage the CPU regardless of how good the cooling is.
  • BCLK: <= 180MHz. My experience is that this is as far as you can go before clock frequencies start to spike all over the place. In the interest of stability, I would recommend not exceeding 177MHz, as this is where 4.8GT/s QPI setting actually equals 6.4GT/s that all the components are rated at – and there seems to be almost no headroom at all for QPI overclocking on components of this generation.

Motherboard Heatsink Fan

As far as I have been able to establish, this only seems to make any appreciable difference in cases of combined extreme BCLK overclocking, IOH over-volting, and using most if not all of the 64 PCIe lanes available through the PCIe slots. In more typical use (two PCIe x16 GPUs, 166MHz BCLK, relatively low 1.250V on the IOH), the difference between the fan being full on (approx. 5000 rpm) and completely off is around 9C (46C fully on, 55C completely off). Consequently, it may be preferable in some cases to remove the aluminium duct plate surrounding the fan, disconnect the fan, and leave the heatsink to passively cool the Intel 5520 I/O Hub, Intel ICH10 South Bridge, and Nvidia NF200 PCIe bridges. The airflow through the case caused by the case fans is likely to be more than sufficient in most if not all installations. This will also prevent the sometimes extreme yet invisible dust build-up in the fins on this heatsink under the aluminium duct plate surrounding the fan causing the temperatures to be higher than they would be if there were no active fan or duct plate present.

Linux

Hot-plug Flapping

This will show up as soon as you start the installer for any distribution you choose. You will receive a flood of messages to the console which will make the system grind to a halt. The workaround is to add pcie_ports=compat to the list of kernel boot parameters. Unfortunately, there is a device on-board that is erroneously marked as hot-pluggable and results in ASPM causing to flap between plugged and unplugged states. Disabling ASPM in the BIOS is not sufficient to fix this.

Intel HD Audio Line Mapping

This took me a while to work out, and had me thinking I had a failed audio port. The front panel connector is using an unusual port, resulting in it not producing output, and not even emitting ACPI events when something is connected and disconnected. The solution is to produce a correct map and supply it to the driver (it turns out problems like this are so common that the snd-hda-intel driver can load such a map at startup.

Simply put this in /lib/firmware/hda-jack-retask.fw:

[codec]
 0x10ec0889 0x00000000 2

 [pincfg]
 0x11 0x411111f0
 0x12 0x59a3112e
 0x14 0x01014c10
 0x15 0x01011c12
 0x16 0x01016c11
 0x17 0x01012c14
 0x18 0x01a19c40
 0x19 0x02a19c50
 0x1a 0x01813c4f
 0x1b 0x0321403f
 0x1c 0x411111f0
 0x1d 0x4015e601
 0x1e 0x01441130
 0x1f 0x01c46160

And put this in /etc/modprobe.d/hda-jack-retask.conf

options snd-hda-intel patch=hda-jack-retask.fw,hda-jack-retask.fw,hda-jack-retask.fw,hda-jack-retask.fw

That should solve the problem.

Final Words

Unfortunately, it took many man-days over the past two years to work out all this, and work out the solutions. It is not acceptable that a high-end flagship product of the sort that the SR-2 was presented to be is so buggy and require so much troubleshooting from the end customer. While the SR-2 has it’s place in history as the board that allowed for overclocking Xeons, along with the gems from a long time ago such as the A-bit BP6 which allowed dual socket operation with Celerons, in the time it took to work around all of it’s bugs it is unfortunately already deprecated, discontinued, and unsupported, and the top of the line Xeons X5690 processors are selling for little enough in the second hand market that the gains simply do not justify the effort, as appeared to be the case 2-3 years ago when starting with the several times cheaper X5650 processors.

In retrospect, when the effort is accounted for, a similar build using a pair of X5690 Xeons and a Supermicro X8DTH-6F motherboard would have almost certainly been a cheaper and less problematic experience. It might not have any overclocking functionality, but while offering the same number of PCIe x16 slots (7) and memory sockets (12), it does support 192GB of RAM (4x more than the SR-2 in the same number of sockets) without any special undocumented approaches required to make it work, and comes with an 8-port SAS controller on-board, while suffering from none of the problems above. Something that just works is usually much more economical than something that ends up requiring many days of troubleshooting effort.