Choice of Compilers – Part 2: SPARC

In the previous article in this series we looked at the performance of different compilers on Intel’s x86/x86-64 architecture. In this article we will look at the performance of the two available compilers on Sun’s SPARC Niagara platform.

Only two compilers were used in this part of the test, and versions of both of these were included in the previous roundup. They are:

  1. GNU GCC (4.2.3)
  2. Sun’s SunCC (12.0, part of Sun Studio development toolkit)

The server used for testing was a Sun T2000 with a 1.0GHz SPARC T1 processor with 8 cores and 32 threads. The operating system was Solaris 10.

The test application fits curves to arbitrary sampled data. It has 4 innermost loops for fitting 4 curve parameters and it implements it’s own partial caching using a technique described in a previous article on this site. The code implements most of the optimizations also mentioned in the said article, in order to improve vectorization and speed of the application.

More information on the application used for testing is available in the previous article that covers compiler performance on the x86 platform.

Compiler Switches

Selecting the correct compiler switches is important for achieving best performance from your code. They can also break things, and make the code that produces results that are quite clearly wrong. The switches used in the tests were the ones found to produce the fastest code without producing results that are wildly inaccurate (some numeric instability is acceptable, as long as it is limited).

Compiler switches used are separated into 2 parts for each compiler. The common part, and the machine specific part. The common part is the same for the given compiler on all platforms, and the machine specific part varies according to the processor. The switches used are heavily based on what the compiler developers deemed a good fully optimized combination (based on the compiler’s -fast shortcut switch or equivalent references in the documentation). Some options, however, have been added/removed/changed due to producing poor results, either in accuracy or speed.

GCC

Common -O3 -fno-strict-aliasing -ffast-math -foptimize-register-move -frerun-loop-opt -fexpensive-optimizations -fprefetch-loop-arrays -fomit-frame-pointer -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=5 -fpic -Wall
SPARC T1 -mcpu=niagara -mtune=niagara

SunCC:

Common -xO5 -xalias_level=simple -fns=yes -fsimple=2 -xbuiltin=%all -xdepend=yes -xlibmil -xlibmopt -xunroll=4 -xprefetch=auto -xprefetch_level=1 -xipo=2 -xldscope=global -Yl,/usr/bin -Qpath /usr/bin -features=extensions
SPARC T1 -xarch=v9 -xchip=ultraT1 -xtarget=native -m64 -xcache=native

Results

Here are the normalized scores, relative to GCC (GCC = 100%, more is better).

GCC 91.79 i/s 100.00%
SunCC 11.74 i/s 12.79% (with -xvector=lib)
SunCC 92.98 i/s 101.30% (without -xvector)

Analysis

Two things became evident during the test. At first we tried using the -xvector=lib parameter on SunCC in order to gain any available advantage from vectorization features of the CPU. This, however, produced results that were an order of magnitude worse that GCC’s. It wasn’t immediately clear why SunCC’s performance was that much lower, but it was quickly pinned down to the -xvector compiler flag. Removing it put the performance of the produced code back on track, and it came out 1.3% ahead of GCC. Presumably, the -xvector=lib flag causing a massive slowdown in the compiled code is down to a compiler or library bug.

The difference here is similar to the difference between GCC and SunCC on the x86 platform, only on SPARC T1 the results are in favour of SunCC.

Unlike on the Intel platform where using Intel’s compiler produced a massive performance boost, on Sun’s platform the advantage of using Sun’s compiler is fairly minimal.

The other thing the vigilant among the reaters may have noticed is that the performance of Sun’s SPARC T1 processor with the best compiler abailable is approximately 67x (a staggering 6,700%) slower than the long obsolete Pentium III used in the x86 compiler roundup (93 i/s T1/SunCC vs. 6231 i/s P3/ICC). In addition, those faimiliar with the T1 processor will know that even though the T1 has 8 cores, it only has one FPU shared between them. This means that unlike on the multi-core Core2 CPUs where performance in this test would scale nearly linearly with multiple parallel processes (only single thread tests were used in both this and the previous article), on the T1 the advantage of multiple cores/threads would be minimal. But then again, Sun openly admit that T1 is not a processor to use for heavy floating point operations, which is largely what this test does.

While this poses some interesting questions about the performance of the T1, the purpose of this article was an analyze performance differences between different compilers on the same platform, rather than relative performance differences between different platforms. A separate article focusing on performance differences between the x86 Core2 and SPARC T1 platforms is already in the making to cover this aspect.

Choice of Compilers – Part 1: x86

In a previous article about writing efficient code, I briefly touched upon the subject of compilers. In this article we will look at the major C++ compilers available for Linux and compare their performance, cost, and highlight any pitfalls encountered.

Platforms

For this review, the following platforms will be used.

Servers

Athlon XP 1.53 GHz RedHat 9
Pentium III 1.49 GHz RedHat 9
Pentium 4 Celeron 1.70 GHz RedHat 9
Core2 x86-64 1.89 GHz CentOS 5

Compilers

GNU GCC 4.2.1
PGI PGCC 7.0.7
Intel ICC 9.1.051
Pathscale PathCC 3.0, CentOS 5 only
Sun’s SunCC 12.0, CentOS 5 only, glibc too old on RedHat 9

Note:

  1. A full x86-64 software stack is used on the Core2 system.
  2. Due to library version requirements, SunCC is only tested on the Core 2 system.
  3. Due to only having one trial licence for PathCC, it was only tested on the Core 2 platform. Due to this being an x86-64 platform, the static binaries produced didn’t work properly on IA32 platforms, due to library dependencies.
  4. All of the above compilers were current, up to date, state of the art versions at the time of writing this article.

Software

Benchmarks are largely meaningless. The only worthwhile benchmark is your specific application. If you are reading articles on this site, the chances are that you are a developer, and that you already know this, but I still feel I have to stress it. The results here are for my application. It was chosen because that is the heavy number crunching application I have worked on recently, and not for reasons of testing any specific set of features on any of the hardware or software mentioned.

Now that the boring disclaimer is over, the application in question fits curves to arbitrary sampled data. It has 4 innermost loops for fitting 4 curve parameters and it implements it’s own partial caching using a technique described in a previous article on this site. The code implements most of the optimizations also mentioned in the said article, in order to improve vectorization and speed of the application.

The loops for curve fitting act as a good test for vectorization capabilities of compilers, and the operations performed in those loops are a combination of trigonometric functions and multiplication / addition. The caching implementation also throws in a pointer chasing element. All the iterators are unsigned integers and all the curve parameters and sampled data are single precision floats.

The nature of the algorithm for curve fitting in the test program is iterative. Different compilers with different optimization parameters suffer from different rounding errors in floating point operations. These rounding errors cause minor differences in convergence, and affect the number of iterations required to converge the solution. This may give one compiler an unfair (and coincidental) advantage over another, so instead of total time taken to converge, the results will be given in terms of iterations/second. Each iteration does the same amount of work, so this is a more reliable performance metric.

Compiler Switches

Selecting the correct compiler switches is important for achieving best performance from your code. They can also break things, and make the code that produces results that are quite clearly wrong. The switches used in the tests were the ones found to produce the fastest code without producing results that are wildly inaccurate (some numeric instability is acceptable, as long as it is limited).

Compiler switches used are separated into 2 parts for each compiler. The common part, and the machine specific part. The common part is the same for the given compiler on all platforms, and the machine specific part varies according to the processor. The switches used are heavily based on what the compiler developers deemed a good fully optimized combination (based on the compiler’s -fast shortcut switch or equivalent references in the documentation). Some options, however, have been added/removed/changed due to producing poor results, either in accuracy or speed.

GCC

Note:
On Core 2, the nocona target architecture is used. This is due to there being no Core 2 specific optimization in this version of the compiler, and Nocona is the latest supported Intel CPU.

PGCC

Common -O3 -fno-strict-aliasing -ffast-math -foptimize-register-move -frerun-loop-opt -fexpensive-optimizations -fprefetch-loop-arrays -fomit-frame-pointer -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=5 -fpic -Wall
Athlon XP -march=athlon-xp -mtune=athlon-xp -mmmx -msse  -m3dnow -mfpmath=sse -m32 -malign-double
Pentium III -march=pentium3  -mtune=pentium3  -mmmx -msse          -mfpmath=sse -m32 -malign-double
Pentium 4 -march=pentium4  -mtune=pentium4  -mmmx -msse2         -mfpmath=sse -m32 -malign-double
Core 2 -march=nocona    -mtune=nocona    -mmmx -msse3         -mfpmath=sse -m64
Common -O4 -Mfprelaxed -Msingle -Mfcon -Msafeptr -Mcache_align -Mflushz -Munroll=c:1 -Mnoframe -Mlre -Mipa=align,arg,const,f90ptr,shape,libc,globals,localarg,ptr,pure -Minfo=all -Mneginfo=all -fpic
Athlon XP -tp=athlonxp -Mvect=sse
Pentium III -tp=piii -Mvect=sse
Pentium 4 -tp=piv -Mvect=sse -Mscalarsse
Core 2 -tp=core2-64 -Mvect=sse -Mscalarsse

ICC

Common -O3 -ansi-alias -fp-model fast=2 -rcd -align -Zp16 -ipo -fomit-frame-pointer -funroll-loops -fpic -w1 -vec-report3
Athlon XP -msse  -xK -march=pentium3 -mcpu=pentium3 -mtune=pentium3 -cxxlib-icc
Pentium III -msse  -xK -march=pentium3 -mcpu=pentium3 -mtune=pentium3 -cxxlib-icc
Pentium 4 -msse2 -xW -march=pentium4 -mcpu=pentium4 -mtune=pentium4 -cxxlib-icc
Core 2 -msse3 -xP

Note:

  1. Athlon XP is using code targetted for the Pentium III because ICC doesn’t specifically support AMD processors. However, Athlon XP and Pentium III share the same instruction set, so it’s close enough.
  2. ICC 9.1.051 doesn’t support Core 2 specific optimizations. The latest it seems to support is Core 1. ICC 10.x supports Core 2, but at the time of writing of this article it was not yet deemed to be of production quality, so the older, stable version was used.

PathCC:

Common -O3 -fno-strict-aliasing -finline-functions -ffast-math -ffast-stdlib -funsafe-math-optimizations -fstrength-reduce -align128 -fomit-frame-pointer -funroll-loops -fpic -gnu4 -Wall -LNO:vintr_verbose=ON
Core2 -march=core -mcpu=core -mtune=core -mmmx -msse3 -m64 -mcmodel=small

SunCC

Common -xO5 -xalias_level=simple -fns=yes -fsimple=2 -nofstore -xbuiltin=%all -xdepend=yes -xlibmil -xlibmopt -xunroll=4 -xprefetch=auto -xprefetch_level=1 -xregs=frameptr -xipo=2 -xldscope=global -Yl,/usr/bin -Qpath /usr/bin -features=extensions
Core 2 -xarch=sse3 -xchip=opteron -xtarget=opteron -xvector=simd -m64 -xcache=native -xmodel=small

Note:

  1. -Yl,/usr/bin and -Qpath /usr/bin are work-arounds for problems with the SunCC linker that makes it fail to link to some dynamic libraries. Instead the system ld is used when these parameters are specified, which may adversely affect performance – but it was the only option available to get the test completed.
  2. RedHat 9 did not have sufficiently up to date libraries to use the compiler, and attempts to produce static binaries for testing failed. This is why this compiler was only tested on CentOS 5.
  3. Opteron target architecture was used because no Intel x86-64 targets were documented. Opteron target worked fine, though.

Results

Let’s churn out some numbers, shall we? The number in bracket is the normalized score, relative to GCC (GCC = 100%, more is better).

Athlon XP GCC 1973 i/s 100%
Athlon XP PGCC 1653 i/s 84%
Athlon XP ICC 5192 i/s 263%
Pentium III GCC 1812 i/s 100%
Pentium III PGCC 1590 i/s 88%
Pentium III ICC 6231 i/s 344%
Pentium 4 GCC 1169 i/s 100%
Pentium 4 PGCC 1130 i/s 97%
Pentium 4 ICC 8437 i/s 722%
Core 2 GCC:     3270 i/s 100%
Core 2 PGCC:    2715 i/s 83%
Core 2 ICC:    16814 i/s 514%
Core 2 PathCC:  3600 i/s 110%
Core 2 SunCC:   3212 i/s 98%

Analysis

A quick note about PGCC’s performance – the test code uses some C math library functions (mainly sinf()). These are not implemented in PGCC in a vectorizable form, and I am told that this is why it’s performance suffers. There are re-assurances that this is being actively worked on and that a fix will be available shortly, but it was not going to be available before my trial licence expires. Having said that, GCC doesn’t currently have sinf() in vectorizable form either, and it it still came out ahead. Another thing worth pointing out is that PGI seem to focus more on building distributed applications on clusters. This review covers no such application, so PGCC may still be a good choice for your application – it just isn’t a good choice for my application.

So, what do the results seem to say? Looking at the directly comparable results, PGCC’s performance is consistently about 12-17% behind GCC. It even almost catches up, on the Pentium 4 being only 3% behind on that platform. Both GCC and PGCC seem to suffer when moving to a faster Pentium 4. Their performance drops by around 30% despite an increase in clock-speed of 13%. This is a rather unimpressive result. ICC’s performance leaves GCC’s and PGCC’s performance in the realm of embarrasing. Not only does it outperform GCC and PGCC by between 2.63x-7.22x and 3.14x-7.47x respectively, but it’s performance increases on the Pentium 4 instead of decreasing as it does with GCC and PGCC. And not only does it increase, it even increases by 20% relative to the clock speed, despite the Pentium 4 Celeron having a much smaller Level 2 cache than the Pentium III in this test (128KB vs. 512KB). This is a truly impressive result. Not only does it show that ICC is a very good compiler, but it also shows that a lot of the perception of Pentium 4’s poor performance comes from poorly performing compilers, rather than from badly designed hardware.

On the Core 2, PGCC appears to close the gap from being 7.5x slower on the Pentium 4 down to a marginally less embarrasing result of being only 6.2x slower than ICC. GCC remains safely ahead ahead of PGCC in performance, but it’s lag behind ICC has increased (5.14x slower) compared to Athlon XP and Pentium III results (2.63x and 3.44x respectively).

Sun’s compiler seems to be somewhat lacking in performance. Its results are roughly on par with GCC. Pathscale’s compiler, however, manages to beat GCC to 2nd place, behind ICC. This is quite a respectable achievement considering that parts of the PathCC compiler suite are based on GCC.

Looking at the logged compiler remarks, it is clear that ICC is gaining massive advantage from knowing how to vectorize loops. Since the code in question uses only single precision floating point numbers, full vectorization performance is achieved on all of the tested processors. Just about all inner loops in the code vectorize. Even though Portland explain that PGCC currently lacks vectorizable versions of more complex math functions (e.g. sinf()), the compiler also failed to vectorize much simpler operations in loops like addition and multiplication. Instead it chose to unroll the loops. The response from PGI support is that it is deemed that unrolling loops with small iteration counts is faster than vectorizing them. Clearly, this seems to be wrong, since both both GCC and ICC vectorized these simple loops, and outperformed PGCC.

Diagnostic output about loop vectorization did not seem to be available from SunCC, as the only similar switches seemed to be related to OpenMP parallelization which was not used in this test.

For those that don’t know what vectorization is all about – you may have heard of SIMD: Single Instruction Multiple Data. The idea is that instead of processing scalars (single values) you can process whole vectors of (i.e. arrays) scalars simultaneously, if the operations you are doing on them is the same. So, instead of processing 4 32-bit floats one at a time, you pack them into a vector (if you have them in an array, or they are appropriately aligned in memory) and process the vector in parallel. x86 processors since the Pentium MMX have had this capability for integer operations and since the Pentium III for floating point operations. Provided that the compiler knows how to convert loop operations on arrays into vector operations, the speed-up can be massive.

Costs

Not all of the compilers reviewed here are free. Some are completely free (GCC, SunCC), some have free components due to being based on GPL-ed code (PathCC), some are free for non-commercial/non-academic use (ICC), and some are fully commercial (PGCC).

For the non-free compilers, the costs listed on the vendor’s web sites are:

PGCC $419
ICC $449 (free for non-commercial/non-academic use)
PathCC $795

Other Issues

A few other issues have been encountered with the compilers during testing. They were resolved, but I still felt they should be mentioned.

PGCC licencing engine proved to be problematic and got in the way of using the compiler even when the instructions were followed correctly. The problem turned out to be caused by a broken component on the PGI servers for generating the trial licence keys. This lead to a day or so of frustration, but was resolved in the end. While I understand that companies have to look after their intellectual property, Intel’s licencing engine worked much more simply and without any problems. Whereas PGI require a key to be generated on their web site and downloaded to a local file with the correct permissions, Intel simply provide a hex string key that can be pasted in at install time when setting up ICC. Intel’s method worked flawlessly. GCC and SunCC don’t have any licence control mechanisms, so this type of issue doesn’t apply to them.

SunCC required -features=extensions to correctly compile the test application, which took a bit of figuring out and a query on the support forum.

Intel, Sun and PGI all provide an online forum where any support issues can be raised and advice sought. In all cases support has been quite fast and at least informative, even if the issues couldn’t be resolved. The nature of the forums allows help to be provided both by the support staff and more experienced end users. For GCC there is extensive community support and mailing lists.

Conclusions

Choice of compilers can make a huge difference to performance of your code. If speed is a consideration in your code (and in some applications it may not be), then the chances are that you cannot afford to ignore the benefit that can be gained by switching to a different compiler. On the x86 platform, ICC is completely unchallenged for the title of producing the fastest code. It is also free for non-commercial use. Another thing worth mentioning is that ICC also produces fastest code on AMD’s x86 processors.

On the most current x86-64 platform tested (Core 2), ICC produces code that runs in the region of 5x faster then the competing compilers. If it is a cost based choice between buying 5x faster (or 5x more) hardware or buying a better compiler, upgrading the compiler wins by an order of magnitude.

Alignment of Structs – Chaotic Evil

Don’t be tripped up by data alignment. Sometimes in C you want to play with unions and structs or just cast to a char * and stream data, but unless you consider the architecture and the alignment it is almost guaranteed to bite you (editor’s comment: or is that byte you?).

Most (read: all I have tested) compilers word align data based on type so: doubles will on x86 be typically 64bit (sometimes 32bit) aligned, ints 32-bit aligned, shorts 16-bit, chars 8-bit, etc.

So consider these 2 structs:

struct Foo
{
	int a;
	int b;
	char d;
	char e;
} A;

struct Bar
{
	char e;
	int b;
	char d;
	int a;
} B;

Do a sizeof(A) and sizeof(B) and you will find they yield very different results. (i got 12 and 16).

Unfortunately it varies compiler to compiler and architecture to architecture so alignment, sizes and spaces will vary. In the example Foo on x86-32 Windows Borland Builder 5:

a is 4 byte word aligned
b follows straight after a so is by default
d follows straight after b so is by default but doesn’t need to be
e is byte aligned and directly follows d

2 bytes are left at the end as the structure requires more than 2 words and cant get 2.5 words so it gets 3 4-byte words

In the example Bar:

e is 4 byte word aligned as it the start of the structure
b, though e is only 1 byte, is 4 byte aligned for efficient memory access so 3 spare bytes are left between them
d follows straight after b so is by default but doesn’t need to be
a, though d is only 1 byte, is 4 byte aligned for efficient memory access so 3 spare bytes are left between them

We can write our code in an intelligent way bearing this in mind to insure that we don’t waste unnecessary memory and that our code is contiguous (without spaces between items), by putting the largest word aligned types first and then smaller types down to smallest at the end.

By doing this we have ended up with a data structure without gaps (except at the end) like in foo which can be sensibly used in a union or stream.

Writing Efficient C/C++ Code

The question of how to make a C/C++ program faster comes up all the time. Responses to such questions are frequently misguided, too smart by half, or just plain wrong. That is why I decided to write an article about reasonably portable tried and true optimization techniques that are basic enough to be generic yet useful enough to yield significant performance gains.

Precision

As a general rule, in order of preference:

  1. Try to stick to word sized variables. If your processor is 32-bit, use 32-bit ints and 32-bit floats.
  2. Don’t use unnecessary floating point precision if you don’t need it. Don’t use a double if a float will do.
  3. Only use non-word-sized variables when you stand to lose nothing else by doing so. For example, if you only need a short and don’t stand to lose vectorization in a loop because of it, then fair enough – in such extreme cases, you might as well save a bit of memory. Note, however, that the memory saving may not be realized in all cases due to data structure alignment issues. See this article for more details.

This is particularly important in code that the compiler can vectorize.

Additionally:

  1. Use unsigned types when you don’t need them to be signed. Operations with them can sometimes be more efficient.
  2. Avoid mixing types in operations. Operations on mixed types are slower because they require implicit casting. Additionally, mixed operations don’t vectorize.

Integers

Since Pentium MMX, vector integer operations have been supported in the x86 line of processors. These typically only support 32-bit integers, in the case off the original MMX instructions, 2 at a time. This means that using integers types that are not 32-bit in size will cause the code to not vectorize – casting a char to an int is typically expensive enough to undo the benefit of vectorization.

The other aspect to consider when dealing with integers is pointer size. When iterating over an array in a loop, it is inefficient to use non-word-sized iterators. Pointers are word sized. Adding an offset that is non-word sized requires the iterator to be promoted, and then the new pointer applied. This adds signifficant overhead in loops and prevents vectorization.

Floating Point

Even if your SIMD vector unit can handle doubles, it can typically handle twice as many floats as doubles for the same operation in the same amount of time.

There is a common misconception that most modern FPUs take no penalty from working with doubles instead of floats, because the internal computation is performed using double precision arithmetic anyway. That the FPU handles everything as doubles internally may or may not be true – but it is also irrelevant. There are at least 3 more important factors to consider:

  • Register size

Most modern FPUs are designed to handle SIMD/vector operations (often referred to as SSE on x86 or AltiVec on PowerPC). The SIMD registers are typically 128-bits wide, and allow for either 4 floats or 2 doubles to be packed in them. Regardless of whether floats are handled with double precision internally, a single vector operation will work on twice as many floats as it does doubles. In other words, it will still go at least twice as fast when performed on multiple data

  • Data size

When the data no longer fits on the CPU cache(s), a round trip to RAM is required to fetch it. This is typically an order of magnitude slower than fetching the data from cache. Floats are half the size of doubles. Thus, twice as many of them can be kept in the caches before they spill out. Thus, all other things being equal, if a double array doesn’t fit in the cache, it can be up to an order of magnitude slower to work with than a float array that does fit in the cache.

  • CPU capabilities

This is not such a huge issue any more, but Pentium III cannot vectorize doubles. SSE1 only supports floats. Pentium 4 and later x86 processors support SSE2 which can handle doubles, but this doesn’t negate point 1. above. Other processors’ vectorization capabilities (ARM Neon, PowerPC Altivec) may require additional considerations, so know your platform.

Compilers

Use a good compiler for your target platform. Sadly, GCC isn’t great. It’s not even good. When it comes to tight vectorized loops and you don’t want to reach for the hand crafted assembler, I have seen performance boosts of between 4.5x (on P3 / Core2) and 8.5x (P4) when using Intel’s ICC compiler instead of GCC. IBM’s XL/C compiler for the PowerPC yields similar improvements on that platform, and Sun have their own optimizing compiler for the SPARC. They are almost certainly worth looking into.

Learn your compiler’s features and switches. Enable reporting of warnings and remarks. Listen to what the compiler is telling you, and work with it. It is on your side. If it says that it couldn’t vectorize a loop, establish why, and if possible, re-write the loop so that it does vectorize. Performance benefits from this can be huge.

Vectorizing

Vectorization is essential for getting most out of the modern processors. Unfortunately, even the best compilers often need a bit of help from the developer with assorted little tweaks to help them to vectorize the code, but they are generally not ugly, and are most of the time limited to one of the following:

  1. Copy the object property into a local variable. This will help persuade compiler that there is no vector dependance it needs to worry about.
  2. If you have a loop where you are operating with the iterator on an array, have a shadow iterator of a vectorizable type. Remember you cannot use mixed types in vector operations. For example, you can do something like this:
static unsigned int	i;
static float			ii;
static float			baz[16];

for (i = 0, ii = 0; i < 16; i++)
	baz[i] *= ii;  // vectorizes
	//baz[i] *= i; // doesn't vectorize

You may also feel clever here and decide to try to cut a corner with the old counting down hack, by rewriting the loop:

for (i = 16, ii = 16; i--;)
	baz[i] *= --ii;

In theory, this should yield code that is a few bytes smaller, and a little faster. Unfortunately, the benefits of this are questionable at best, and counter-productive more often than not. CPU Caches are typically optimized for forward rather than backward reads, and performance has been observed to actually reduce when using this trick on some otherwise superb compilers. At the very least benchmarl your code/compiler combination to see if there is a worthwhile benefit to be had.

General Optimizations

  • Avoid Division

Consider your divisions carefully. Divisions are expensive. If you are dividing by the same value multiple times, get 1 / value and multiply by that instead.

  • Keep Your Loops Tight

Move as much code as you can outside a loop. If there are parts of your calculation that can simplify to a single expression, calculate it ouside the loop.

  • Static Variables

Declare variables as static where possible. This especially goes for large arrays. Static variables are allocated on the heap, rather than on the stack, and only get allocated once. That means that if you keep calling a function, a dynamic array will keep getting malloc()-ed and free()-ed, and this is expensive. A static array will not. If you know the maximum size you’ll need for an array, static declare it once, and re-use it.

  • Bitwise Operations Are Cheap – Use Them

For example, where powers of 2 are involved:

Integer Modulo:

x % 8   x & 0x7

Bitwise AND to truncate everything but the modulo bits.

Integer Multiply:

x * 8   x << 3

(Big-endian) Left shift by n to multiply by 2^n

Truncating floats: *((unsigned int*)&b) &= 0xFFFFFF00;

Truncates float b to 16-bit precision (8-bit exponent)

  • Partial Caching

If your function parameters are changing partially, evaluate them partially and cache the results for each part so you don’t have to re-calculate. For example, if your function is something like:

Y = a * (bX^2 + c) + d;

Arrange your loops so the outer-most one works on the inner-most evaluation (in this case X*X, followed by multiplication by b, followed by addition of c, followed my multiplication by a, followed by addition of d in the outermost loop). You can then cache the values of X*X (which is, incidentally, much faster than (pow(X,2)), b*X^2, b*X^2+c, etc, so when your outer parameters change, you don’t have to re-calculate the inner terms. How you design your caches is also important. This can cause more overhead than it saves, so you have to optimize it very carefully while keeping your underlying algorithm structure in mind.

For a quick and dirty implementation you can use the STL map, along the lines of:

typedef map <float, float*> Cache1;
Cache1 MyCache;
MyCache[a] = MyFloatArrayPointer

This will enable you to establish the cache hit rates, but the chances are that the performance will be too slow for real world use. Write your own, custom caching storage/lookup algorithm specifically suited to your application.

If you don’t need the extra precision, consider a float truncation technique, as described above. Floats can be slightly unstable with extreme compiler optimizations enabled. This may disrupt the cache keys and cause a miss when there should be a hit. Truncating a float will reduce the possibility of this happening.

  • Bespoke Code

Write the optimized code for your speciffic application yourself. Where performance isn’t an issue, 3rd party libraries are great for rapid development or proof of concept prototypes, but there is a price to pay in terms of performance when using generic code vs. bespoke code specifically optimized for a particular task.

  • Compilers and Numerical Stability

Learn the compiler switches for your compiler. Test the accuracy of your resulting numbers. When you start cutting corners (e.g. “-ffast-math -mfpmath=sse,387” on GCC or “-fp-model fast=2 -rcd” on ICC) you may get more speed, but sometimes the precision on your floating point variables will reduce. This may or may not be acceptable for your application.

  • Think About Your Algorithms

All the optimizing tweaks in the world won’t fix a bad algorithm. Optimize your design before your implementation. For example, if you are doing a calculation that is a modulo of a power of 2 on an int, use a bit-wise AND instead of a modulo operator. It is an order of magnitude faster. (e.g. instead of X %= 4, do X &= 0x3).

It is impossible to summarize all possible optimization techniques, and they will differ from project to project and won’t all be appropriate all the time. But hopefully – this will get you started on the right path.