Shared Root Single System Image Clustering

When running multiple servers that are supposed to be nearly identical (e.g. a cluster where all nodes have the same functionality), managing the configuration can be a daunting task if not approached in the right way. In this article we will explore the tools at our disposal to make this easier and the approaches we can take to ensure that our solution has the best available compromise between performance and maintainability.

Shared Root

An excellent way to implicitly maintain the equality of configuration and package installations of multiple servers is to configure them to share the root file system. This ensures that any configuration changes on one of the nodes are implicitly and immediately applied to all the other nodes. Since there is no scope for the configuration and packages to get out of sync, the complexities of maintenance are greatly reduced. It effectively simplifies the task from one of administering n nodes down to administering 1 node.

Open Shared Root

Open shared root is the de facto standard for implementing shared root clusters on Linux. It provides all the required tools for creating an initrd that contains everything needed to start and mount the shared file system that will be used for root. Something like this is required if the shared root file system is anything other than NFS. Linux kernels support NFS booting on the kernel level, so an additional bootstrap that OSR provides isn’t strictly necessary, but even then it is still useful.

File systems that can be used for shared root fall broadly into two categories:

Cluster file systems
Examples of cluster file systems include:
GFS and GFS2 (from RedHat)
OCFS, OFCS2 (from Oracle)
VMFS (VMware)
VxCFS (Veritas/Symantec)

These file systems are characterized by the fact that they exist directly on top of a block device. In the context of shared root these are typically provided by iSCSI, ATAoE, Fibre Channel or DRBD, but directly shared SCSI buses are also sometimes used.

Network file systems
Examples of network file systems include:
NFS
CIFS (a.k.a. SMB or Samba)
GlusterFS

These file systems are characterized by the fact that they export an already existing, underlying file system. The underlying file system is typically one that exist directly on a block device, such as the cluster file systems mentioned above or any of the many non-cluster local file systems (e.g. ext3, NTFS, …).

When a cluster file system is used for the shared root, there is another important consideration – not only does the file system itself have to be supported by the bootstraping process, but the underlying block device has to be supported, too.

OSR supports a number of combinations of several of file systems and block devices mentioned above (see the OSR OS platform support matrix for most up to date information). At the time of writing of this article, support for GlusterFS based OSR isn’t listed in the matrix, but support does exist for it – the author of this article knows this for a fact since he developed and contributed the patch and the corresponding documentation to the OSR project :-).

The OSR website has plenty of excellent documentation and howtos on how to configure it for most common scenarios, so there seems little need to repeat them here.

Pitfalls

While OSR has a lot going for it, it isn’t without its own share of potential pitfalls. It is great when it works (and it is mature enough that most of the time it does just work). One of the ongoing aspects of its development is the minimization of the initrd bootstrap footprint. However, the fact that it has to start up all the clustering components opens a sizable scope for things to break. The author of this article has seen it happen more than once that an update to key clustering software components even on enterprise distributions has resulted in the said components becoming broken and rendering the entire cluster unbootable. This is particularly problematic on OSR because the entire system is rendered unbootable with relatively limited ability to troubleshoot the problem, since the entire root file system is unavailable. For just such cases it would be very useful to have a more fully featured bootstrap root that would allow for much more graceful troubleshooting and recovery.

Another complication introduced by OSR is that the startup and shutdown sequences have to be adjusted to cooperate gracefully with the fact that they are running inside the pre-initialisation bootstrap. Specifically, this means being careful which processes need to be excluded during shutdown’s killall5 execution, and which file systems should be left mounted during the first stage of the shutdown sequence (the rootfs needs to be left to the bootstrap root shutdown sequence to unmount). OSR comes with patches to the init scripts to take care of this, but in some cases (such as with GlusterFS), the process is not as straightforward and foolproof as one might hope. When this process hits a problem, the shutdown sequence hangs.

All this got me thinking about a similar approach but with key differences to address the above concerns.

Virtualized Shared Root

To summarize, the two features that this approach was designed to add compared to the vanilla OSR method are:

  1. Availability of a fully featured bootstrap environment.
  2. Removal of need to pay special attention to startup and shutdown sequences due to the peculiarities of running on a shared root.

The obvious way to implement 1) is to ignore the OSR bootstrap and simply use a normal (albeit a relatively minimal) OS install to prepare and bootstrap the volumes for the shared root instance. This works reasonably well, but it does bring a problem with it – the boostrap OS isn’t implicitly identical between the nodes. In OSR this is addressed by the fact that the same initrd is used on all the nodes, so even though the bootstrap OS isn’t permanently shared, a high degree of consistency exists due to the bootstrap being initialized from the same image at every boot. So for the sake of tidiness and feature equivalence with OSR, some method must be applied to ensure that the copies are kept in sync. The tool used to achieve this is csync2.

csync2 is similar to rsync, but is specifically designed for synchronizing a set of files to a large number of remote nodes. I am not going to go into details of csync2 setup here because good documentation exist on the Linbit website and elsewhere. The csync2 configuration file I use is provided because it lists which files should be excluded from the synchronization.

group openvz-osr
{
host openvz-osr1;
host (openvz-osr2);
key /etc/csync2/openvz-osr.key;

include /*;
exclude /dev;
exclude /etc/adjtime;
exclude /etc/blkid;
exclude /etc/csync2/csync2_ssl_*;
exclude /etc/mtab;
exclude /etc/glusterfs;
exclude /etc/sysconfig/hwconf;
exclude /etc/sysconfig/network;
exclude /etc/sysconfig/network-scripts/ifcfg-eth0;

exclude /etc/sysconfig/networking;
exclude /etc/sysconfig/vz-scripts;
exclude /gluster;
exclude /proc;
exclude /sys;

exclude /tmp;
exclude /usr/libexec/hal-*;
exclude /usr/libexec/hald-*;
exclude /var/cache;
exclude /var/csync2/backup;
exclude /var/ftp;
exclude /var/lib/csync2;
exclude /var/lib/nfs/rpc_pipefs;
exclude /var/lib/openais;
exclude /var/lock;
exclude /var/log;
exclude /var/run;
exclude /var/spool;
exclude /var/tmp;

exclude /vz;

include /vz/template;

backup-directory /var/csync2/backup;
backup-generations 3;

auto none;
}

The main thing to pay attention to here is that some files need to be host specific, rather than shared/mirrored (this is, BTW, also the case with OSR). Specifically, these include things like csync2 host keys and network configuration settings (the two nodes still have different names and IP addresses even if they are supposed to be identical in all other ways). As a bare minimum, on any shared root system at least the files/directories highlighted in red in the above config should be kept host-specific. The directories highlighted in blue are virtual file systems that are node-specific and unshareable. The rest will depend on the exact nature and purpose of the system.

The first csync2 run typically takes a few minutes, and subsequent syncs typically take a few seconds. If run as a daily cron job (or manually after any software or configuration update), this will ensure that nodes’ bootstrap OS is kept in sync.

The way 2) is achieved is by using OpenVZ para (pseudo?) virtualization. What originally got me thinking about taking this approach is that OSR effectively fires up the shared root init chrooted to the shared volume it brought up. This is conceptually very similar to FreeBSD’s Jails and Solaris’ Zones. The Linux equivalent of those is OpenVZ. It provides very thin virtualization of the process ID space (in some cases init not having PID of 1 can cause problems) and the networking stack (so that each VM can have independent networking). Just like Jails and Zones, OpenVZ doesn’t use a disk image – instead VM’s files live as ordinary files in the directory path where the OpenVZ chroot exists (usually /vz/private/). This makes it particularly convenient to use shared root – all that is required is that the shared file system is mounted in /vz/private.

This approach delivers in full on the original goal of making the startup and shutdown processes more robust and avoiding the need for init script patches. (Note: for cleanliness a few lines of rc.sysinit could do with commenting out because some features and /proc paths aren’t applicable to OpenVZ chroots, but this is purely to avoid errors being reported during startup.) Additionally, due to the shared root node being virtualized, it is possible to reboot the shared root node without rebooting the entire server. This is in itself quite a useful feature. As with OSR and csync2 approaches, some files and directories should be unshared (see the red list above).

Disk and File System Optimization

While RAID and flash disks have become much more common over the recent years, some of the old advisories on extracting best performance out of them appear to have become deprecated for most common uses. In this article I will try to cover the basic file system optimisations that every Linux system administrator should know and apply regularly.

Performance from the ground up

The default RAID block size offered by most controllers and Linux software RAID of 64-256KB is way too big for normal use. It will kill the performance of small IO operations without yielding a significant increase in performance for large IOs, and sometimes even hurting large IO operations, too.

To pick the optimum RAID block size, we have to consider the capability of the disks.

Multi-sector transfers

Modern disks can handle transfers of multiple sectors in a single operation, thus significantly reducing the overheads caused by the latencies of the bus. You can find the multi-sector transfer capability of a disk using hdparm. Here is an example:

# hdparm -i /dev/hda

/dev/hda:

 Model=WDC WD400BB-75FRA0, FwRev=77.07W77, SerialNo=WD-WMAJF1111111
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=74
 BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=78125000
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1 ATA/ATAPI-2 ATA/ATAPI-3
ATA/ATAPI-4 ATA/ATAPI-5 ATA/ATAPI-6

 * signifies the current active mode[/code]

The thing we are interested in here is the following:

MaxMultSect=16, MultSect=16

I have not seen any physical disks recently that have this figure at anything other than 16, and 16 sectors * 512 bytes/sector = 8KB. Thus, 8KB is a reasonable choice for the RAID block (a.k.a. chunk in software RAID) size for traditional mechanical disks.

There are a few exceptions to the these figures – some modern disks such as the very recent Western Digital ones support sector sizes of 4KB, so it is important to check the details and make sure what you’re dealing with. Also, some virtualization platforms provide virtual disk emulation that supports transfers of as many as 128 sectors. However, in the case of virtual disk images, especially sparse ones, it is impossible to make any reasonable guesstimates about the underlying physical hardware so none of this applies in a meaningful way.

Flash memory

For flash based solid state disks, however, there are a few additional things to consider. Flash disks can only be erased in relatively large blocks, typically between 128KiB and 512KiB. Size of writes can have a massive influence on performance because in the extreme case, a single sector write of 512 bytes still ends up resulting in a whole 512KiB block being erased and re-written. Worse, if we are not careful about the way we align our partitions, we could easily end up with file system blocks (typically 4KB) that span multiple physical flash blocks, which means that for all writes to those spanning blocks we would end up having to write two flash blocks. This is bad for both performance and longevity of the disk.

Disk Geometry – Mechanical Disks

There is an additional complication in that alignment of the virtual disk geometry also plays a major role with flash memory.

On a mechanical disk the geometry is variable due to the inherently required logical translation for compensating for varying numbers of sectors per cylinder (inner cylinders have smaller circumference and thus fewer sectors than the outer cylinders). This means that any optimization we may try to do to ensure that superblock and extent beginnings never span cylinder boundaries (and thus avoid a track-to-track seek overhead, which is nowdays, fortunately, very low) is relatively meaningless, because the next cylinder shrink could throw out our calculation. While this used to be a worthwhile optimisation 25 years ago, it is, sadly, no longer the case.

There is a useful side-effect of this translation that one should be aware of. Since outer cylinders have more sectors and the rpm is constant, it follows that the beginning of the disk is faster than the end. Thus, the most performance crytical partitions (e.g. swap) should be physically at the front of the disk. The difference in throughput between the beginning and the end of the disk can be as much as two-fold, so this is quite important!

Disk Geometry – Flash

Flash disks require no such translation so careful geometry alignment is both useful and worthwhile. To take advantage of it, we first have to look at the erase block size of the flash disk in use. If we are lucky, the manufacturer will have provided the erase block
size in the documentation. Most, however, don’t seem to. In the absence of definitive documentation, we can try to guesstimate this by doing some benchmarking. The theory is simple – we disable hardware disk caching (hdparm -W0) and test the speed of unbuffered writes to the disk using:

dd if=/dev/zero of=/dev/[hs]d[a-z] oflag=direct bs=[8192|16384|32768|65536|131072|262144|524288]

What we should be able to observe is that the performance will increase nearly linearly up to erase block size (typically, but not always, 128KiB), and then go flat.

Once we have this, we need to partition the disk with the geometry such that cylinders always start at the beginning of an erase block. Since these will always be powers of 2, the default CHS geometry with 255 heads and 63 sectors per track is pretty much the worst that can be chosen. If we set it for 128 heads and 32 sectors per track, however, things become much more sane for aligning cylinders to erase block boundaries. This yields 2MB cylinders which should work well for just about all flash disks. Thus, we can run fdisk by explicitly telling it the geometry:

fdisk -H 128 -S 32 /dev/[hs]d[a-z]

One important thing to note is that the first partition (physically) on the disk doesn’t start at sector 0. This is a hangover from DOS days, but if we used the first cylinder as is, we would end up messing up the alignment of our first partition. So, what we can do instead is make a partition spanning only the 1st cylinder and simply not use it. We waste a bit of space but that is hardly a big deal. Alternatively, we could also the /boot partition at the beginning of the disk as that changes very infrequently and is never accessed after booting.

Next we have to look at what available options are available for the file system we intend to use. The ext2/34 file systems provides several parameters that are worth looking at.

Stride

The stride parameter is used to adjust the file system layout so that data and metadata for each block is placed on different disks. This improves performance because the operations can be parallelized.

This is specifically related to RAID – on a single disk we cannot distribute this load and there is more to be gained by keeping the data and metadata in adjecent blocks to avoid seek times and make better use of read-ahead.

The stride parameter should be set so that ext4 block size (usually 4KB) * stride = chunk size, in this case ext3 block size = 4KB, stride = 2, RAID chunk = 8KB.

mkfs.ext4 -E stride=2

Stripe Width

This is a setting that both RAID arrays and flash media can benefit from. It aims to arrange blocks so that writes will be to a whole stripe at once, rather than suffer a double hit on the read-modify-write operation that RAID levels with parity (RAID 3,4,5,6) suffer.
This benefit is also directly applicable to flash media because on flash we have to write an entire erase block, so cramming more useful data-writes into that single operation has a positive effect both in terms of performance and disk longevity. If the erase block size (or stripe width size for RAID) is, for example, 128KiB, we should set stripe-width = 128KiB / 4KiB = 32:

mkfs.ext4 -E stripe-width=32

Block Groups

So far so good, but we’re not done yet. Next we need to to consider the extent / block group size. The beginning of each block group contains a superblock for that group. It is the top of that inode subtree, and needs to be checked to find any file/block in that group. That means the beginning block of a block group is a major hot-spot for I/O, as it has to be accessed for every I/O operation on that group. This, in turn, means that for anything like reasonable performance we need to have the block group beginnings distributed evenly across all the disks in the RAID array, or else one disk will end up doing most of the work while the others are sitting idle.

For example, the default for ext2/3/4 is 32768 blocks in a block group. The adjustment can only be made in increments of 8 blocks (32KB assuming 4KB blocks). Other file systems may have different granularity.

The optimum number of blocks in a group will depend on the RAID level and the number of disks in the array, but you can simplify it into a RAID0 equivalent for the purpose of this exercise e.g. 8 disks in RAID6 can be considered to be 6 disks in RAID0. Ideally you want the block group size to align to the stripe width +/- 1 stride width so that the block group beginnings rotate among the disks (upward for +1 stride, downward for -1 stride, both will achieve the same effect).

The stripe width in the case described is 8KB * 6 disks = 48KB. So, for optimal performance, the block group should align to a multiple of 8KB * 7 disks = 56KB. Be careful here – in the example case we need a number that is a multiple of 56KB, but not a multiple of 48KB because if they line up, we haven’t achieved anything and are back where we started!

56KB is 14 4KB blocks. Without getting involved in a major factoring exercise, 28,000 blocks sounds good (default is 32768 for ext3, which is in a reasonable ball park). 28,000*4KB is a multiple of 56KB but not 48KB, so it looks like a reasonable choice in this example.

Obviously, you’ll have to work out the optimal numbers for your specific RAID configuration, the above example is for:
disk multi-sector = 16
ext3 block size = 4KB
RAID chunk size = 8KB
ext3 stripe-width = 12
ext3 stride = 2
RAID = 6
disks = 8

mkfs.ext4 -g 28000

In case a flash disk is being used, the default value of 32768 is fine since this results in block groups that are 128MB in size. 128MB is a clean multiple of all likely erase block sizes, so no adjustment is necessary.

Journal Size

Journal size can also be adjusted to optimize array performance. Ideally, the journal should be sized to fill a multiple of stripe size. In the example above, this means a multiple of 48KB. The default is 128MB, which doesn’t quite fit, but 126MB (for example) does.

mkfs.ext4 -J size=32256

Since flash disks typically have very fast reads and access time, it is possible to not use journalling at all. Some crash-proofing will be lost, but fsck will typically complete very quickly in a SSD, thus minimizing the requirement for having a journal in environments that don’t require the extra degree of crash-proof data consistency. In journalling is not required, simply use ext2 file system instead:

mkfs.ext2

or disable the journal:

mkfs.ext4 -O ^has_journal

Growing

If you are certain that the file system will never need to be grown, you can adjust the amount of reserved space for new inodes. Unfortunately, the growth limit has to be a few percept bigger than the current file system size, but this is still better than the default of 1000x bigger or 16TB, whichever is smaller. This will also free up some space for data.

mkfs.ext4 -E resize=6T

Crippling Abstraction

The sort of finesse explained above that can be applied to extract better (and sometimes _massively_ better) performance from disks is one of the key reasons why LVM (Logical Volume Management) should be avoided where possible. It abstracts things and encourages a lack of forward thinking. Adding a new volume is the same as adding a new disk to a software RAID to stretch it. It’ll upset the block group size calculation and disrupt the advantage of load balancing across all the disks in the array that we have just carefully established. By doing this you can cripple the performance on some operations from scaling linearly with the number of disks to being bogged down to the performance of just one disk.

This can make a massive difference to IOPS figures you get out of a storage system. There is scope for offsetting this, but it reduces the flexibility somewhat. You could carve up the storage into identical logical volumes, each of which is carefully aligned to the underlying physical storage and add logical volumes in appropriate quantitiy (rather than just one at a time) so that the block groups and journal size still align in an optimal way.

Choice of Compilers – Part 2: SPARC

In the previous article in this series we looked at the performance of different compilers on Intel’s x86/x86-64 architecture. In this article we will look at the performance of the two available compilers on Sun’s SPARC Niagara platform.

Only two compilers were used in this part of the test, and versions of both of these were included in the previous roundup. They are:

  1. GNU GCC (4.2.3)
  2. Sun’s SunCC (12.0, part of Sun Studio development toolkit)

The server used for testing was a Sun T2000 with a 1.0GHz SPARC T1 processor with 8 cores and 32 threads. The operating system was Solaris 10.

The test application fits curves to arbitrary sampled data. It has 4 innermost loops for fitting 4 curve parameters and it implements it’s own partial caching using a technique described in a previous article on this site. The code implements most of the optimizations also mentioned in the said article, in order to improve vectorization and speed of the application.

More information on the application used for testing is available in the previous article that covers compiler performance on the x86 platform.

Compiler Switches

Selecting the correct compiler switches is important for achieving best performance from your code. They can also break things, and make the code that produces results that are quite clearly wrong. The switches used in the tests were the ones found to produce the fastest code without producing results that are wildly inaccurate (some numeric instability is acceptable, as long as it is limited).

Compiler switches used are separated into 2 parts for each compiler. The common part, and the machine specific part. The common part is the same for the given compiler on all platforms, and the machine specific part varies according to the processor. The switches used are heavily based on what the compiler developers deemed a good fully optimized combination (based on the compiler’s -fast shortcut switch or equivalent references in the documentation). Some options, however, have been added/removed/changed due to producing poor results, either in accuracy or speed.

GCC

Common -O3 -fno-strict-aliasing -ffast-math -foptimize-register-move -frerun-loop-opt -fexpensive-optimizations -fprefetch-loop-arrays -fomit-frame-pointer -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=5 -fpic -Wall
SPARC T1 -mcpu=niagara -mtune=niagara

SunCC:

Common -xO5 -xalias_level=simple -fns=yes -fsimple=2 -xbuiltin=%all -xdepend=yes -xlibmil -xlibmopt -xunroll=4 -xprefetch=auto -xprefetch_level=1 -xipo=2 -xldscope=global -Yl,/usr/bin -Qpath /usr/bin -features=extensions
SPARC T1 -xarch=v9 -xchip=ultraT1 -xtarget=native -m64 -xcache=native

Results

Here are the normalized scores, relative to GCC (GCC = 100%, more is better).

GCC 91.79 i/s 100.00%
SunCC 11.74 i/s 12.79% (with -xvector=lib)
SunCC 92.98 i/s 101.30% (without -xvector)

Analysis

Two things became evident during the test. At first we tried using the -xvector=lib parameter on SunCC in order to gain any available advantage from vectorization features of the CPU. This, however, produced results that were an order of magnitude worse that GCC’s. It wasn’t immediately clear why SunCC’s performance was that much lower, but it was quickly pinned down to the -xvector compiler flag. Removing it put the performance of the produced code back on track, and it came out 1.3% ahead of GCC. Presumably, the -xvector=lib flag causing a massive slowdown in the compiled code is down to a compiler or library bug.

The difference here is similar to the difference between GCC and SunCC on the x86 platform, only on SPARC T1 the results are in favour of SunCC.

Unlike on the Intel platform where using Intel’s compiler produced a massive performance boost, on Sun’s platform the advantage of using Sun’s compiler is fairly minimal.

The other thing the vigilant among the reaters may have noticed is that the performance of Sun’s SPARC T1 processor with the best compiler abailable is approximately 67x (a staggering 6,700%) slower than the long obsolete Pentium III used in the x86 compiler roundup (93 i/s T1/SunCC vs. 6231 i/s P3/ICC). In addition, those faimiliar with the T1 processor will know that even though the T1 has 8 cores, it only has one FPU shared between them. This means that unlike on the multi-core Core2 CPUs where performance in this test would scale nearly linearly with multiple parallel processes (only single thread tests were used in both this and the previous article), on the T1 the advantage of multiple cores/threads would be minimal. But then again, Sun openly admit that T1 is not a processor to use for heavy floating point operations, which is largely what this test does.

While this poses some interesting questions about the performance of the T1, the purpose of this article was an analyze performance differences between different compilers on the same platform, rather than relative performance differences between different platforms. A separate article focusing on performance differences between the x86 Core2 and SPARC T1 platforms is already in the making to cover this aspect.

Choice of Compilers – Part 1: x86

In a previous article about writing efficient code, I briefly touched upon the subject of compilers. In this article we will look at the major C++ compilers available for Linux and compare their performance, cost, and highlight any pitfalls encountered.

Platforms

For this review, the following platforms will be used.

Servers

Athlon XP 1.53 GHz RedHat 9
Pentium III 1.49 GHz RedHat 9
Pentium 4 Celeron 1.70 GHz RedHat 9
Core2 x86-64 1.89 GHz CentOS 5

Compilers

GNU GCC 4.2.1
PGI PGCC 7.0.7
Intel ICC 9.1.051
Pathscale PathCC 3.0, CentOS 5 only
Sun’s SunCC 12.0, CentOS 5 only, glibc too old on RedHat 9

Note:

  1. A full x86-64 software stack is used on the Core2 system.
  2. Due to library version requirements, SunCC is only tested on the Core 2 system.
  3. Due to only having one trial licence for PathCC, it was only tested on the Core 2 platform. Due to this being an x86-64 platform, the static binaries produced didn’t work properly on IA32 platforms, due to library dependencies.
  4. All of the above compilers were current, up to date, state of the art versions at the time of writing this article.

Software

Benchmarks are largely meaningless. The only worthwhile benchmark is your specific application. If you are reading articles on this site, the chances are that you are a developer, and that you already know this, but I still feel I have to stress it. The results here are for my application. It was chosen because that is the heavy number crunching application I have worked on recently, and not for reasons of testing any specific set of features on any of the hardware or software mentioned.

Now that the boring disclaimer is over, the application in question fits curves to arbitrary sampled data. It has 4 innermost loops for fitting 4 curve parameters and it implements it’s own partial caching using a technique described in a previous article on this site. The code implements most of the optimizations also mentioned in the said article, in order to improve vectorization and speed of the application.

The loops for curve fitting act as a good test for vectorization capabilities of compilers, and the operations performed in those loops are a combination of trigonometric functions and multiplication / addition. The caching implementation also throws in a pointer chasing element. All the iterators are unsigned integers and all the curve parameters and sampled data are single precision floats.

The nature of the algorithm for curve fitting in the test program is iterative. Different compilers with different optimization parameters suffer from different rounding errors in floating point operations. These rounding errors cause minor differences in convergence, and affect the number of iterations required to converge the solution. This may give one compiler an unfair (and coincidental) advantage over another, so instead of total time taken to converge, the results will be given in terms of iterations/second. Each iteration does the same amount of work, so this is a more reliable performance metric.

Compiler Switches

Selecting the correct compiler switches is important for achieving best performance from your code. They can also break things, and make the code that produces results that are quite clearly wrong. The switches used in the tests were the ones found to produce the fastest code without producing results that are wildly inaccurate (some numeric instability is acceptable, as long as it is limited).

Compiler switches used are separated into 2 parts for each compiler. The common part, and the machine specific part. The common part is the same for the given compiler on all platforms, and the machine specific part varies according to the processor. The switches used are heavily based on what the compiler developers deemed a good fully optimized combination (based on the compiler’s -fast shortcut switch or equivalent references in the documentation). Some options, however, have been added/removed/changed due to producing poor results, either in accuracy or speed.

GCC

Note:
On Core 2, the nocona target architecture is used. This is due to there being no Core 2 specific optimization in this version of the compiler, and Nocona is the latest supported Intel CPU.

PGCC

Common -O3 -fno-strict-aliasing -ffast-math -foptimize-register-move -frerun-loop-opt -fexpensive-optimizations -fprefetch-loop-arrays -fomit-frame-pointer -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=5 -fpic -Wall
Athlon XP -march=athlon-xp -mtune=athlon-xp -mmmx -msse  -m3dnow -mfpmath=sse -m32 -malign-double
Pentium III -march=pentium3  -mtune=pentium3  -mmmx -msse          -mfpmath=sse -m32 -malign-double
Pentium 4 -march=pentium4  -mtune=pentium4  -mmmx -msse2         -mfpmath=sse -m32 -malign-double
Core 2 -march=nocona    -mtune=nocona    -mmmx -msse3         -mfpmath=sse -m64
Common -O4 -Mfprelaxed -Msingle -Mfcon -Msafeptr -Mcache_align -Mflushz -Munroll=c:1 -Mnoframe -Mlre -Mipa=align,arg,const,f90ptr,shape,libc,globals,localarg,ptr,pure -Minfo=all -Mneginfo=all -fpic
Athlon XP -tp=athlonxp -Mvect=sse
Pentium III -tp=piii -Mvect=sse
Pentium 4 -tp=piv -Mvect=sse -Mscalarsse
Core 2 -tp=core2-64 -Mvect=sse -Mscalarsse

ICC

Common -O3 -ansi-alias -fp-model fast=2 -rcd -align -Zp16 -ipo -fomit-frame-pointer -funroll-loops -fpic -w1 -vec-report3
Athlon XP -msse  -xK -march=pentium3 -mcpu=pentium3 -mtune=pentium3 -cxxlib-icc
Pentium III -msse  -xK -march=pentium3 -mcpu=pentium3 -mtune=pentium3 -cxxlib-icc
Pentium 4 -msse2 -xW -march=pentium4 -mcpu=pentium4 -mtune=pentium4 -cxxlib-icc
Core 2 -msse3 -xP

Note:

  1. Athlon XP is using code targetted for the Pentium III because ICC doesn’t specifically support AMD processors. However, Athlon XP and Pentium III share the same instruction set, so it’s close enough.
  2. ICC 9.1.051 doesn’t support Core 2 specific optimizations. The latest it seems to support is Core 1. ICC 10.x supports Core 2, but at the time of writing of this article it was not yet deemed to be of production quality, so the older, stable version was used.

PathCC:

Common -O3 -fno-strict-aliasing -finline-functions -ffast-math -ffast-stdlib -funsafe-math-optimizations -fstrength-reduce -align128 -fomit-frame-pointer -funroll-loops -fpic -gnu4 -Wall -LNO:vintr_verbose=ON
Core2 -march=core -mcpu=core -mtune=core -mmmx -msse3 -m64 -mcmodel=small

SunCC

Common -xO5 -xalias_level=simple -fns=yes -fsimple=2 -nofstore -xbuiltin=%all -xdepend=yes -xlibmil -xlibmopt -xunroll=4 -xprefetch=auto -xprefetch_level=1 -xregs=frameptr -xipo=2 -xldscope=global -Yl,/usr/bin -Qpath /usr/bin -features=extensions
Core 2 -xarch=sse3 -xchip=opteron -xtarget=opteron -xvector=simd -m64 -xcache=native -xmodel=small

Note:

  1. -Yl,/usr/bin and -Qpath /usr/bin are work-arounds for problems with the SunCC linker that makes it fail to link to some dynamic libraries. Instead the system ld is used when these parameters are specified, which may adversely affect performance – but it was the only option available to get the test completed.
  2. RedHat 9 did not have sufficiently up to date libraries to use the compiler, and attempts to produce static binaries for testing failed. This is why this compiler was only tested on CentOS 5.
  3. Opteron target architecture was used because no Intel x86-64 targets were documented. Opteron target worked fine, though.

Results

Let’s churn out some numbers, shall we? The number in bracket is the normalized score, relative to GCC (GCC = 100%, more is better).

Athlon XP GCC 1973 i/s 100%
Athlon XP PGCC 1653 i/s 84%
Athlon XP ICC 5192 i/s 263%
Pentium III GCC 1812 i/s 100%
Pentium III PGCC 1590 i/s 88%
Pentium III ICC 6231 i/s 344%
Pentium 4 GCC 1169 i/s 100%
Pentium 4 PGCC 1130 i/s 97%
Pentium 4 ICC 8437 i/s 722%
Core 2 GCC:     3270 i/s 100%
Core 2 PGCC:    2715 i/s 83%
Core 2 ICC:    16814 i/s 514%
Core 2 PathCC:  3600 i/s 110%
Core 2 SunCC:   3212 i/s 98%

Analysis

A quick note about PGCC’s performance – the test code uses some C math library functions (mainly sinf()). These are not implemented in PGCC in a vectorizable form, and I am told that this is why it’s performance suffers. There are re-assurances that this is being actively worked on and that a fix will be available shortly, but it was not going to be available before my trial licence expires. Having said that, GCC doesn’t currently have sinf() in vectorizable form either, and it it still came out ahead. Another thing worth pointing out is that PGI seem to focus more on building distributed applications on clusters. This review covers no such application, so PGCC may still be a good choice for your application – it just isn’t a good choice for my application.

So, what do the results seem to say? Looking at the directly comparable results, PGCC’s performance is consistently about 12-17% behind GCC. It even almost catches up, on the Pentium 4 being only 3% behind on that platform. Both GCC and PGCC seem to suffer when moving to a faster Pentium 4. Their performance drops by around 30% despite an increase in clock-speed of 13%. This is a rather unimpressive result. ICC’s performance leaves GCC’s and PGCC’s performance in the realm of embarrasing. Not only does it outperform GCC and PGCC by between 2.63x-7.22x and 3.14x-7.47x respectively, but it’s performance increases on the Pentium 4 instead of decreasing as it does with GCC and PGCC. And not only does it increase, it even increases by 20% relative to the clock speed, despite the Pentium 4 Celeron having a much smaller Level 2 cache than the Pentium III in this test (128KB vs. 512KB). This is a truly impressive result. Not only does it show that ICC is a very good compiler, but it also shows that a lot of the perception of Pentium 4’s poor performance comes from poorly performing compilers, rather than from badly designed hardware.

On the Core 2, PGCC appears to close the gap from being 7.5x slower on the Pentium 4 down to a marginally less embarrasing result of being only 6.2x slower than ICC. GCC remains safely ahead ahead of PGCC in performance, but it’s lag behind ICC has increased (5.14x slower) compared to Athlon XP and Pentium III results (2.63x and 3.44x respectively).

Sun’s compiler seems to be somewhat lacking in performance. Its results are roughly on par with GCC. Pathscale’s compiler, however, manages to beat GCC to 2nd place, behind ICC. This is quite a respectable achievement considering that parts of the PathCC compiler suite are based on GCC.

Looking at the logged compiler remarks, it is clear that ICC is gaining massive advantage from knowing how to vectorize loops. Since the code in question uses only single precision floating point numbers, full vectorization performance is achieved on all of the tested processors. Just about all inner loops in the code vectorize. Even though Portland explain that PGCC currently lacks vectorizable versions of more complex math functions (e.g. sinf()), the compiler also failed to vectorize much simpler operations in loops like addition and multiplication. Instead it chose to unroll the loops. The response from PGI support is that it is deemed that unrolling loops with small iteration counts is faster than vectorizing them. Clearly, this seems to be wrong, since both both GCC and ICC vectorized these simple loops, and outperformed PGCC.

Diagnostic output about loop vectorization did not seem to be available from SunCC, as the only similar switches seemed to be related to OpenMP parallelization which was not used in this test.

For those that don’t know what vectorization is all about – you may have heard of SIMD: Single Instruction Multiple Data. The idea is that instead of processing scalars (single values) you can process whole vectors of (i.e. arrays) scalars simultaneously, if the operations you are doing on them is the same. So, instead of processing 4 32-bit floats one at a time, you pack them into a vector (if you have them in an array, or they are appropriately aligned in memory) and process the vector in parallel. x86 processors since the Pentium MMX have had this capability for integer operations and since the Pentium III for floating point operations. Provided that the compiler knows how to convert loop operations on arrays into vector operations, the speed-up can be massive.

Costs

Not all of the compilers reviewed here are free. Some are completely free (GCC, SunCC), some have free components due to being based on GPL-ed code (PathCC), some are free for non-commercial/non-academic use (ICC), and some are fully commercial (PGCC).

For the non-free compilers, the costs listed on the vendor’s web sites are:

PGCC $419
ICC $449 (free for non-commercial/non-academic use)
PathCC $795

Other Issues

A few other issues have been encountered with the compilers during testing. They were resolved, but I still felt they should be mentioned.

PGCC licencing engine proved to be problematic and got in the way of using the compiler even when the instructions were followed correctly. The problem turned out to be caused by a broken component on the PGI servers for generating the trial licence keys. This lead to a day or so of frustration, but was resolved in the end. While I understand that companies have to look after their intellectual property, Intel’s licencing engine worked much more simply and without any problems. Whereas PGI require a key to be generated on their web site and downloaded to a local file with the correct permissions, Intel simply provide a hex string key that can be pasted in at install time when setting up ICC. Intel’s method worked flawlessly. GCC and SunCC don’t have any licence control mechanisms, so this type of issue doesn’t apply to them.

SunCC required -features=extensions to correctly compile the test application, which took a bit of figuring out and a query on the support forum.

Intel, Sun and PGI all provide an online forum where any support issues can be raised and advice sought. In all cases support has been quite fast and at least informative, even if the issues couldn’t be resolved. The nature of the forums allows help to be provided both by the support staff and more experienced end users. For GCC there is extensive community support and mailing lists.

Conclusions

Choice of compilers can make a huge difference to performance of your code. If speed is a consideration in your code (and in some applications it may not be), then the chances are that you cannot afford to ignore the benefit that can be gained by switching to a different compiler. On the x86 platform, ICC is completely unchallenged for the title of producing the fastest code. It is also free for non-commercial use. Another thing worth mentioning is that ICC also produces fastest code on AMD’s x86 processors.

On the most current x86-64 platform tested (Core 2), ICC produces code that runs in the region of 5x faster then the competing compilers. If it is a cost based choice between buying 5x faster (or 5x more) hardware or buying a better compiler, upgrading the compiler wins by an order of magnitude.

Alignment of Structs – Chaotic Evil

Don’t be tripped up by data alignment. Sometimes in C you want to play with unions and structs or just cast to a char * and stream data, but unless you consider the architecture and the alignment it is almost guaranteed to bite you (editor’s comment: or is that byte you?).

Most (read: all I have tested) compilers word align data based on type so: doubles will on x86 be typically 64bit (sometimes 32bit) aligned, ints 32-bit aligned, shorts 16-bit, chars 8-bit, etc.

So consider these 2 structs:

struct Foo
{
	int a;
	int b;
	char d;
	char e;
} A;

struct Bar
{
	char e;
	int b;
	char d;
	int a;
} B;

Do a sizeof(A) and sizeof(B) and you will find they yield very different results. (i got 12 and 16).

Unfortunately it varies compiler to compiler and architecture to architecture so alignment, sizes and spaces will vary. In the example Foo on x86-32 Windows Borland Builder 5:

a is 4 byte word aligned
b follows straight after a so is by default
d follows straight after b so is by default but doesn’t need to be
e is byte aligned and directly follows d

2 bytes are left at the end as the structure requires more than 2 words and cant get 2.5 words so it gets 3 4-byte words

In the example Bar:

e is 4 byte word aligned as it the start of the structure
b, though e is only 1 byte, is 4 byte aligned for efficient memory access so 3 spare bytes are left between them
d follows straight after b so is by default but doesn’t need to be
a, though d is only 1 byte, is 4 byte aligned for efficient memory access so 3 spare bytes are left between them

We can write our code in an intelligent way bearing this in mind to insure that we don’t waste unnecessary memory and that our code is contiguous (without spaces between items), by putting the largest word aligned types first and then smaller types down to smallest at the end.

By doing this we have ended up with a data structure without gaps (except at the end) like in foo which can be sensibly used in a union or stream.

Writing Efficient C/C++ Code

The question of how to make a C/C++ program faster comes up all the time. Responses to such questions are frequently misguided, too smart by half, or just plain wrong. That is why I decided to write an article about reasonably portable tried and true optimization techniques that are basic enough to be generic yet useful enough to yield significant performance gains.

Precision

As a general rule, in order of preference:

  1. Try to stick to word sized variables. If your processor is 32-bit, use 32-bit ints and 32-bit floats.
  2. Don’t use unnecessary floating point precision if you don’t need it. Don’t use a double if a float will do.
  3. Only use non-word-sized variables when you stand to lose nothing else by doing so. For example, if you only need a short and don’t stand to lose vectorization in a loop because of it, then fair enough – in such extreme cases, you might as well save a bit of memory. Note, however, that the memory saving may not be realized in all cases due to data structure alignment issues. See this article for more details.

This is particularly important in code that the compiler can vectorize.

Additionally:

  1. Use unsigned types when you don’t need them to be signed. Operations with them can sometimes be more efficient.
  2. Avoid mixing types in operations. Operations on mixed types are slower because they require implicit casting. Additionally, mixed operations don’t vectorize.

Integers

Since Pentium MMX, vector integer operations have been supported in the x86 line of processors. These typically only support 32-bit integers, in the case off the original MMX instructions, 2 at a time. This means that using integers types that are not 32-bit in size will cause the code to not vectorize – casting a char to an int is typically expensive enough to undo the benefit of vectorization.

The other aspect to consider when dealing with integers is pointer size. When iterating over an array in a loop, it is inefficient to use non-word-sized iterators. Pointers are word sized. Adding an offset that is non-word sized requires the iterator to be promoted, and then the new pointer applied. This adds signifficant overhead in loops and prevents vectorization.

Floating Point

Even if your SIMD vector unit can handle doubles, it can typically handle twice as many floats as doubles for the same operation in the same amount of time.

There is a common misconception that most modern FPUs take no penalty from working with doubles instead of floats, because the internal computation is performed using double precision arithmetic anyway. That the FPU handles everything as doubles internally may or may not be true – but it is also irrelevant. There are at least 3 more important factors to consider:

  • Register size

Most modern FPUs are designed to handle SIMD/vector operations (often referred to as SSE on x86 or AltiVec on PowerPC). The SIMD registers are typically 128-bits wide, and allow for either 4 floats or 2 doubles to be packed in them. Regardless of whether floats are handled with double precision internally, a single vector operation will work on twice as many floats as it does doubles. In other words, it will still go at least twice as fast when performed on multiple data

  • Data size

When the data no longer fits on the CPU cache(s), a round trip to RAM is required to fetch it. This is typically an order of magnitude slower than fetching the data from cache. Floats are half the size of doubles. Thus, twice as many of them can be kept in the caches before they spill out. Thus, all other things being equal, if a double array doesn’t fit in the cache, it can be up to an order of magnitude slower to work with than a float array that does fit in the cache.

  • CPU capabilities

This is not such a huge issue any more, but Pentium III cannot vectorize doubles. SSE1 only supports floats. Pentium 4 and later x86 processors support SSE2 which can handle doubles, but this doesn’t negate point 1. above. Other processors’ vectorization capabilities (ARM Neon, PowerPC Altivec) may require additional considerations, so know your platform.

Compilers

Use a good compiler for your target platform. Sadly, GCC isn’t great. It’s not even good. When it comes to tight vectorized loops and you don’t want to reach for the hand crafted assembler, I have seen performance boosts of between 4.5x (on P3 / Core2) and 8.5x (P4) when using Intel’s ICC compiler instead of GCC. IBM’s XL/C compiler for the PowerPC yields similar improvements on that platform, and Sun have their own optimizing compiler for the SPARC. They are almost certainly worth looking into.

Learn your compiler’s features and switches. Enable reporting of warnings and remarks. Listen to what the compiler is telling you, and work with it. It is on your side. If it says that it couldn’t vectorize a loop, establish why, and if possible, re-write the loop so that it does vectorize. Performance benefits from this can be huge.

Vectorizing

Vectorization is essential for getting most out of the modern processors. Unfortunately, even the best compilers often need a bit of help from the developer with assorted little tweaks to help them to vectorize the code, but they are generally not ugly, and are most of the time limited to one of the following:

  1. Copy the object property into a local variable. This will help persuade compiler that there is no vector dependance it needs to worry about.
  2. If you have a loop where you are operating with the iterator on an array, have a shadow iterator of a vectorizable type. Remember you cannot use mixed types in vector operations. For example, you can do something like this:
static unsigned int	i;
static float			ii;
static float			baz[16];

for (i = 0, ii = 0; i < 16; i++)
	baz[i] *= ii;  // vectorizes
	//baz[i] *= i; // doesn't vectorize

You may also feel clever here and decide to try to cut a corner with the old counting down hack, by rewriting the loop:

for (i = 16, ii = 16; i--;)
	baz[i] *= --ii;

In theory, this should yield code that is a few bytes smaller, and a little faster. Unfortunately, the benefits of this are questionable at best, and counter-productive more often than not. CPU Caches are typically optimized for forward rather than backward reads, and performance has been observed to actually reduce when using this trick on some otherwise superb compilers. At the very least benchmarl your code/compiler combination to see if there is a worthwhile benefit to be had.

General Optimizations

  • Avoid Division

Consider your divisions carefully. Divisions are expensive. If you are dividing by the same value multiple times, get 1 / value and multiply by that instead.

  • Keep Your Loops Tight

Move as much code as you can outside a loop. If there are parts of your calculation that can simplify to a single expression, calculate it ouside the loop.

  • Static Variables

Declare variables as static where possible. This especially goes for large arrays. Static variables are allocated on the heap, rather than on the stack, and only get allocated once. That means that if you keep calling a function, a dynamic array will keep getting malloc()-ed and free()-ed, and this is expensive. A static array will not. If you know the maximum size you’ll need for an array, static declare it once, and re-use it.

  • Bitwise Operations Are Cheap – Use Them

For example, where powers of 2 are involved:

Integer Modulo:

x % 8   x & 0x7

Bitwise AND to truncate everything but the modulo bits.

Integer Multiply:

x * 8   x << 3

(Big-endian) Left shift by n to multiply by 2^n

Truncating floats: *((unsigned int*)&b) &= 0xFFFFFF00;

Truncates float b to 16-bit precision (8-bit exponent)

  • Partial Caching

If your function parameters are changing partially, evaluate them partially and cache the results for each part so you don’t have to re-calculate. For example, if your function is something like:

Y = a * (bX^2 + c) + d;

Arrange your loops so the outer-most one works on the inner-most evaluation (in this case X*X, followed by multiplication by b, followed by addition of c, followed my multiplication by a, followed by addition of d in the outermost loop). You can then cache the values of X*X (which is, incidentally, much faster than (pow(X,2)), b*X^2, b*X^2+c, etc, so when your outer parameters change, you don’t have to re-calculate the inner terms. How you design your caches is also important. This can cause more overhead than it saves, so you have to optimize it very carefully while keeping your underlying algorithm structure in mind.

For a quick and dirty implementation you can use the STL map, along the lines of:

typedef map <float, float*> Cache1;
Cache1 MyCache;
MyCache[a] = MyFloatArrayPointer

This will enable you to establish the cache hit rates, but the chances are that the performance will be too slow for real world use. Write your own, custom caching storage/lookup algorithm specifically suited to your application.

If you don’t need the extra precision, consider a float truncation technique, as described above. Floats can be slightly unstable with extreme compiler optimizations enabled. This may disrupt the cache keys and cause a miss when there should be a hit. Truncating a float will reduce the possibility of this happening.

  • Bespoke Code

Write the optimized code for your speciffic application yourself. Where performance isn’t an issue, 3rd party libraries are great for rapid development or proof of concept prototypes, but there is a price to pay in terms of performance when using generic code vs. bespoke code specifically optimized for a particular task.

  • Compilers and Numerical Stability

Learn the compiler switches for your compiler. Test the accuracy of your resulting numbers. When you start cutting corners (e.g. “-ffast-math -mfpmath=sse,387” on GCC or “-fp-model fast=2 -rcd” on ICC) you may get more speed, but sometimes the precision on your floating point variables will reduce. This may or may not be acceptable for your application.

  • Think About Your Algorithms

All the optimizing tweaks in the world won’t fix a bad algorithm. Optimize your design before your implementation. For example, if you are doing a calculation that is a modulo of a power of 2 on an int, use a bit-wise AND instead of a modulo operator. It is an order of magnitude faster. (e.g. instead of X %= 4, do X &= 0x3).

It is impossible to summarize all possible optimization techniques, and they will differ from project to project and won’t all be appropriate all the time. But hopefully – this will get you started on the right path.