High Performance Computing Environment

OPTIMIZATION IN GENERAL

Introduction

This is about the optimizing of CPU intensive computer programs. Although optimization of computer code is dependent on the actual CPU at hand, there are some general principles which, when obeyed, will result in code that performs well on most, if not all, modern processors.

In order to get some feeling for the important aspects of producing efficient programs, we first take a look at the common characteristics of modern processors:

pipelined floating point processor. In general, the floating point processor is able to produce one or even two and more floating point results per clock-cycle. Depending on the length of the pipeline, it takes a number of clock-cycles before the result of the first floating point instruction becomes available. Pipelines exist for multiplying, adding and subtracting. Division is never pipelined and consequently, division takes many CPU cycles.

two or more levels of cache memory. Main memory is relatively slow: the processor is able to execute many instructions in the time needed to fetch data and instructions from main memory. Therefore, a hierarchy of memory is used, from fast to slow: CPU-registers, level1 cache, level2 cache, main memory. Data flows from main memory via the caches to the CPU registers. Complicated hardware takes care that recently addressed data, together with adjacent data is available in the caches. Level 1 cache is typically 64 Kilobyte, Level2 ranges from 128K to several Megabytes, main memory is typically hundreds of Megabytes. Some processors even have a Level3 cache. In general, processors have separated caches for instructions and data.

How to make use efficiently of the modern hardware ?

First, one should evaluate the possibility to use vendor libraries. Because vendors try to advertise their products as being the fastest processor available, in general they take care that optimized numerical libraries tuned for their own processors are available. In general, vendors have available optimized LAPACK and BLAS subroutines, and Fast Fourier subroutines. When these libraries are not available, usage can be made from the ATLAS package. When one installs this freely available package, it determines on what kind of hardware it is running. Based upon the characteristics found, optimal code for the BLAS and some LAPACK subroutines is generated. Often it is not easily possible to transform one problem to standard linear algebra and Fourier transforms. In those cases it is necessary to produce code yourself. Roughly the following steps are to be taken:

Choice of language
The choice of language is not only a matter of efficiency, but also a matter of personal taste and experience. Also standardization considerations can play a role. In any case, one should opt for a compiled language, and not a interpreted one, even if it uses an intermediaries byte-code. This means, that in general Python, Java and Perl are not the languages of choice for CPU intensive parts of your program. In general, one should use Fortran, Fortran90 or C. C++ is an option, but special care must be taken that no time consuming constructors and destructors are used in the CPU intensive parts. In general, Fortran77 programs are the most efficient, partly because of the long history and experience in Fortran compiler, part because of the language itself. According to the Fortran standard, a compiler can make assumptions about the layout of data that are not permitted in the other languages.

Finding the hot spots
When optimizing an existing program, the first step should be the identification of the 'hot spots' of the program. Often, a program spends most of its time in a few lines of code. To find these lines, special profiling tools are often available, depending on the operating system. As a start, it is best first to compile the program first with no optimization. Optimization by the compiler can rearrange code, so that the line numbers the profiler shows are sometimes misleading. When one understands the behavior of the program with respect to the location of the hot spots, the code optimized by the compiler can be profiled, to see if some hot spots are already dealt with by the compiler. When profiling a program, it is important to run the program with a typical example input such that the program spends its time in parts that are typical for a normal run. It is also important, that the code runs in a decent number of minutes (1 to 10), in order to reduce man hours needed to optimize.

Analyzing the hot spots
After locating the hot spots, one should attempt to find out what kind of work is done in the hot spots. It could be the case that another algorithm is in order, for example replacing a bubble sort algorithm (popular, but very inefficient) by a quick sort or heap sort algorithm. In general, one will not be so lucky, and the code must be analyzed in terms of efficient register and cache usage and so on.

Coding for efficient cache usage

Register and cache usage is very important for the efficiency of a program. To use registers and cache efficiently some global knowledge about the way compilers generate code is necessary. When the compiler detects a loop, then in general the loop variables are placed in registers, which allows for very fast access. Also, terms that do not change value during the loop, are calculated beforehand and the result is placed in a register for efficient use. In general, compilers will perform the optimizations one could do by hand at a first glance at the loop. So, it is important that loops are recognizable for a compiler, and also the structure of data access should be apparent. For example: a Fortran construction like

do i=1,m

  do j=1,n

    c(i,j)=0

    do k=1,p

      c(i,j) = c(i,j)+a(i,k)*n(k,j)

    enddo

  enddo

enddo

Note: this is not an optimal piece of code! and will be optimized by a modern compiler. It is even possible that the compiler recognizes this as a matrix multiplication and generates very efficient code that is specially developed for this purpose. In general the compiler will recognize the fact that the address of c(i,j) is constant in the k-loop and that c(i,j) can be placed in a register. Only after the end of the k-loop c(i,j) is to be written to memory. The loop variables i,j and k are kept in registers. Only after the end of the i-loop the end-values + 1 could be written to memory (however: following the Fortran standard, a loop variable is undefined after a completion of a loop, so even this step is not necessary). One could have coded the example as follows:

        i=0

  10    i=i+1

        if (i .gt. m) goto 100

        j=0

  20    j=j+1

        if (j .gt. n) goto 20

        k=0

        x=0.0

  30    k=k+1

        if (k .gt. p) goto 80

        x = x+a(i,k)*b(k,j)

        goto 30

  80    c(i,j)=x

        goto 20

  100   continue

Obviously, this code is much less clear for the human eye, but one can also expect that a compiler will not recognize this as a three-nested loop. So, depending on the sophistication of the compiler, it is possible that in this case no loop is recognized at all, and that the compiler will only perform some very basic optimizations. In general one should try to:

write short loops, using the constructs of the language (Fortran: do-loop, C: for-loop)

do not exit from a inner loop (no goto or break)

don't call subroutines or functions in a inner loop (of course, when the time the function takes is relatively long, than there is no harm calling a function in a loop. But then, the loop would not be a hot spot, but the function)

make loop-independent (sub)expressions and address calculations easily recognizable, for example, use

for (i=0; i!=n; i++)

a[i] = a[i]+x+y;  /* x and y loop-independent */

and not:

for (i=0; i!=n; i++)

a[i] = x+a[i]+y;  /* x and y loop-independent */

Even if the compiler notices in the second case, that x+y is constant in the loop, it may decide not to take advantage of that fact, because adding numbers in a different sequence can yield different results when dealing with floating point representations with a limited accuracy.

Don't put too much code in a single subroutine. The compiler has a limited amount of space for analyzing code, and a large subroutine may be too large for it's internal tables. If possible, put the hot spots in separate subroutines with not too much surrounding code. This is also more convenient when trying to hand-optimize the code.

General coding principles to get optimized code

For the modern processors there is one important issue: if a data element is accessed, take care that the same element is accessed in the near future (next iteration or next statement) or that a nearby element is accessed in the near future. This strategy will take care of optimizing the use of the processor cache or even the register usage. A simple example:
The loops

for (i=0; i!=n; i++)

  a[i] = i;


for (i=0; i!=n; i++)

  b[i]=sin(a[i]);

should be combined:

for (i=0; i!=n; i++) {

  a[i]=i;

  b[i]=sin(a[i]) /* reuse of a[i] */
	
}

Often, compilers can do this automatically, but maybe the first loops were separated by many lines of code. In that case the likelihood that the compiler will recognize the optimizing opportunities will be less. Another example:

   dimension a(n),b(n),c(n),d(n),e(n),f(n),x(n)

   do i=1,n

     x(i) = a(i)+b(i)+c(i)+d(i)+e(i)+f(i)

   enddo

This doesn't look like a candidate for optimizing. But one cannot be sure that the cache can accommodate the neighborhoods of all array elements a(i) .. f(i). Even if the cache can hold enough data, there is also the problem of the mapping from real memory to cache memory. For example, it can very well be, that element a(1) gets the same location in the cache as element b(1), although they have different locations in real memory. To remove this problem one could recode the program as follows:

dimension a(6,n),x(n)

do i=i,n

  x(i)=0

  do j=1,6

    x(i) = x(i)+a(j,i)

  enddo

enddo

Now the elements of the matrix a are accessed in the order they are lay-out in memory: a(1,i) is followed by a(2,i) and so on.
Note that in C the order is different. The C equivalent would be:

double a[n][6], x[n];

for (i=0; i!=n; i++) {

  x[i]=0;


  for (j=0; j!=6; j++)

    x[i] += a[i][j];

}

Optimal order of loops

This brings us on another topic: when accessing elements of a multidimensional array, for Fortran the inner loop should involve the first index. For other languages, the inner loop should involve the last index.

Indirect addressing
From the above one can deduce that using indirect addressing as in

for (i=0; i!=n; i++)

  a[ b[i] ] = ....

   

for (i=0; i!=n; i++)

  x += a[ b[i] ];

is inefficient. This not always avoidable and again, it does no harm when such a loop is not a hot spot in your program. When it is, one could consider the following possibility, usable when the array b does not change every time the loop is entered:

... calculation of b ...

/* gather elements of a, indexed by b into an array */

for (i=0; i!=n; i++)

  aa[i]=a[ b[i] ];

And then many times:

for (i=0; i!=n; i++)

  aa[i] = ...

     
for (i=0; i!=n; i++)

  x += aa[i];

/* Finally, if needed: restore aa in a */

for (i=0; i!=n; i++)

  a[ b[i] ] = aa[i];

Loop unrolling
Loop unrolling is a technique to decrease the number of iterations in a loop, thus decreasing loop overhead. For example:

do i=1,100

  a(i)=b(i)

enddo

could be rewritten as follows:

do i=1,100,4

  a(i)=b(i)

  a(i+1)=b(i+1)

  a(i+2)=b(i+2)

  a(i+3)=b(i+3)

enddo

In the first loop, 100 iterations are needed, in the second only 25, so in the second loop fewer tests and jumps are necessary.
Although this technique was used in former days, we don't recommend it because this kind of optimization will be done by the compilers. Knowledge about the optimum degree of unrolling is built in in the compilers and automatically code is generated to deal with cases where the degree of unrolling doesn't divide evenly the total number of iterations. Moreover, hand-unrolled loops are less readable.

Expensive expressions
Multiplications, additions and subtractions are very fast operations but division is extremely expensive in terms of CPU time. On most processors, a division takes as much time as taking the square root. Although one should try to avoid divisions, sometimes a division is necessary. Here are a few examples to eliminate divisions:

read *,x

do i=1,n

  a(i)=b(i)/x

enddo

Can be replaced with:

read *,x  

x1 = 1.0/x

do i=1,n

  a(i)=b(i)*x1

enddo

Compilers are in general able to generate code like this. There is a little problem with this transformation: it is possible that c/x differs in the last bit from c*(1.0/x), consider for example the case that c=14.0 and x=7. Probably, c/x will yield exactly 2.0, whereas c*(1.0/x) maybe yield exactly 2.0, or maybe not, because 1.0/7.0 is not exactly representable in a floating point element. It depends on the design of the compiler if this optimization is done. So, when division takes place in a hot spot, one should look for opportunities to replace the division by a multiplication. In general, clearness of code will suffer, but in this case it is worthwhile.

Replacing calls to intrinsics (sin, cos etc) by a table lookup
In general, the intrinsic functions are optimized to a very high degree. If one considers to optimize the calculation of this kind of functions, one should realize that a table lookup without interpolation costs already at least a conversion from float to integer. When the operand has to be normalized, even more expensive operations are needed. So, table lookup has a very limited use. Note, that table lookup can also disturb the data caches. In general, it does not pay to circumvent calls to sin, cos etc. Of course, it is always a good practice not to call the function twice with the same argument:

x=sin(y);

p=sin(y)*sqrt(y);

should be replaced by:

x=sin(y);

p=x*sqrt(y);

(In this simple case, the compiler will probably do this automatically).

Minimizing call overhead
Sometimes, a subroutine is called very often in a inner loop. When the work done in the subroutine is small, the calling overhead can overwhelm the work and there can be a reason to try to eliminate this overhead.
For example, this C function surely is too short when called in a inner loop:

double add(double x, double y)

 {

   return x+y;

 }

The remedy is: replace the call to the subroutine with in-line code. In this case:

for (i=0; i!=n; i++)

  z = add(z,x(i));

should be replaced with:

for (i=0; i!=n; i++)

  z += x(i);

In cases that are not as evident as in this example, for example when the subroutine contains more code, or when the subroutine is used on many places, then the compiler can often help. Most compilers have an inline option: when used the compiler will replace the call to the subroutine with code from the subroutine itself. Most of the times this will result in a faster program, but again: only when the call to the subroutine is a hot spot.

Avoid conversions
Conversions from integer to float and vice versa should be avoided because they are expensive. You don't have to bother with very simple constructs like

x = 0;  /* x being a float */

because the compiler will recognize this, and generates code as if one had specified:

x = 0.0;

In constructs like:

double x[n];

int m[n];

     

for (i=0; i!=n; i++)

  x[i] = x[i] * m[i];

in every iteration an conversion from integer to float has to be done: first m[i] is converted to float, then the multiplication is done. In this case one could consider the possibility to change the type of m to double.

Usage of multidimensional arrays
Example:

dimension a(n,n)

   do j=1,n

    do i=1,n

      a(i,j) = ...

    enddo

  enddo

This example uses an two dimensional array, the first index runs fastest, so it is quite optimal. To calculate the address of a(i,j), an expression like n*(j-1)+i must be evaluated, so one could be tempted to use a one dimensional array instead:

dimension b(n*n)

      
k=0

do j=1,n

  do i=1,n

    b(k+i) = ...

  enddo

  k=k+n

enddo

This looks very efficient, and in fact, it is, but modern compilers will transform the two-dimensional example to something very much like the one dimensional example and can produce even more efficient code involving keeping addresses in registers. So, there really is no need to avoid arrays with more than one dimension. (In old programs the one dimensional variety was often used, because the compiler technology was not as sophisticated as it is now). In ANSI C, there is a problem: You can define multidimensional arrays and use them, but, unlike in Fortran, it is not possible to pass them to a subroutine, along with the dimension, for example:
int func(int m, int n, double a[m][n])
is forbidden in ANSI C (gcc, however allows this construct) Of course, one can use pointers in C:

void func(int m, int n, double **a); /* prototype */

int main()

{

   ........

   double **a;

   int m=20, n=10;

   int i,j;

   a  = (double **) malloc(m*sizeof(double *));

   for (i=0; i!=m; i++)

     a[i] = (double *) malloc(n*sizeof(double));

       

for (i=0; i!=m; i++)

     for (j=0; j!=n; j++)

       a[i][j] = 100*i+j;

          
   ........

   func(m,n,a);

   ........

}

This works satisfactorily, but not optimal. The address of a[i][j] cannot be calculated from i, j and the dimensions of the matrix, but must be deduced from the value of a pointer to a row of the matrix. This involves an extra memory reference and takes extra cache space. This is probably the reason, that in C it is quite common to use one dimensional arrays, and calculate the offset from the start of the array:

void func(int m, int n, double* a); /* prototype */

int main()

{

   ........

   double *a;

   int m=20, n=10;

   int i,j;

   a=(double *) malloc(m*n*sizeof(double));

   k=0;

   for (i=0; i!=m; i++) {

     for (j=0; j!=n; j++)

       a[k+j] = 100*i+j;

     k += n;

   }
     

   ........

   func(m,n,a);

      
   ........

}