Home
Projects
Software
Bio
Books
Papers
Committees
Presentations

Benchmarking

The purpose of benchmarking is to understand the behavior of a system. To this end, benchmarks must be repeatable and they must make best use of the system. This page describes some steps that can be taken to ensure that benchmarks are both well-documented and make best use of the system.

Documenting Benchmarks

As part of your benchmark, record the environment variables and PATH. The printenv command will output this data for you.

The compilers also have many options and configuration settings. The compiler options -qlist and -qlistopt will create a list file that contains more information about the exact version of the compiler and the various settings. Knowing these values can help us work with IBM to understand any unexpected behavior.

Compiler Options

There are many compiler options that may affect performance. One goal of this phase of the benchmarking is to understand which of these options are valuable, and on what kinds of code.

The compilers are named blrts_xlc and blrts_xlf, and has many of the options listed below enabled by default. The option -qlistopt will list the options in use (in a file with extension .lst).

-qbgl
Marks this executable for BG/L only
-qarch=440d
Use both floating point units. To use only one of the two units, use -qarch=440
-qtune=440
Optimize the object code for the BG/L processor
-qcache=level=1:type=i:size=32:line=32:assoc=64:cost=8
-qcache=level=1:type=d:size=32:line=32:assoc=64:cost=8
-qcache=level=2:type=c:size=4096:line=128:assoc=8:cost=40
Specifies the cache sizes for the BG/L processor
-qnoautoconfig
Allows code to be cross-compiled on other machines at high optimization levels without losing BG/L-specific options

-qhot=simd
Enable the compiler options to "vectorize" loops (this means to use special instructions for handling more than one element at a time; there is no separate vector unit)
-O3
Set the optimization level to 3 (fairly agressive). Levels 4 and 5 are also available; you should try to run at least at -O2.
IBM recommends always using the first five of the options above (-qbgl through -qnoautoconfig). If possible, run your benchmarks with the following choices (all with -qbgl -qnoautoconfig):
-qarch=440 -qtune=440
-qarch=440d -qtune=440
-qarch=440d -qtune=440-qcache=...
-qarch=440 -qtune=440-qcache=...-O3
-qarch=440 -qtune=440-qcache=...-O3 -qhot
-qarch=440 -qtune=440-qcache=...-O5 -qhot
-qarch=440d -qtune=440-qcache=...-O3
-qarch=440d -qtune=440-qcache=...-O3 -qhot
-qarch=440d -qtune=440-qcache=...-O5 -qhot
These will help use understand the relative benefits of the different options. If you run only a few, consider running with optimizations -O5, -O3, and with 2nd floating point unit enabled (-qarch=440d) and disabled (-qarch=440).

Code Tuning

Data alignment and pointer (non)aliasing are important items to consider in tuning code for BG/L (and for many modern processors).

Data Alignment

Data that is aligned on more than 4- or 8-byte boundaries and that does not cross a cache line can be handled more efficiently in the Power architecture. There are a number of pseudo-functions that can be used to inform the compiler that data has particular alignment properties (the compiler knows the alignment of statically allocated data).

In C/C++, the pseudo function is __alignx. The prototype is

    void __alignx( int n, const void *addr )
where n is the alignment in the number of bytes that applies to pointer addr. For example, if x is aligned on a 16-byte boundary, you can use
    __alignx(16,y)
C Users can use
#ifndef HAVE___ALIGNX
#define __alignx(a,b)
#endif
to keep code portable to other compilers.

In Fortran, the pseudo function is ALIGNX(N,Y), where N is the same as for __alignx, and Y is a variable of any type.

Pointer Aliasing

This section applies mostly to C and C++ programmers. For the compiler to efficiently schedule load and store commands and to unroll loops for performance, it often needs to know whether two pointers can point at the same data. If so, these pointers are said to alias one another, and the compiler may be unable to perform some optimizations.

Well-written, modern C code will use the restrict qualifier to indicate that a pointer does not alias any other pointers in performance-critical code. This qualifier is used in the same way that register, const, or volatile may be used. For example

    void scale( double *restrict a, const double *restrict b, 
                const double sc ) {
      int i;
      for (i=0; i<10; i++) {
          a[i] = b[i] * sc;
      }
    }

With the xlc family of compilers, it may be necessary to use a pragma to achieve the same effect (it appears that restrict is not recognized, though __restrict is recognized). The line

#pragma disjoint (*a,*b)
tells the xlc compiler that pointers a and b point to different memory. This is more precise that the C restrict qualifier, but is not portable to other systems. The BG/L RedBook recommends the use of the #pragma form. In a few experiments with the xlc compiler on icrunch, the #pragma form appeared to generate better code (independent loads moved ahead of stores).

Parallelism

There are two principle models: the communication co-processor model and the virtual node model. In the communication co-processor model, the second processor is used exclusively for supporting communication. In the virtual node model, each processor supports a separate MPI process, with each processor receiving half the memory of the node.

Process to Processor Mapping

The way in which processes are assigned to physical processors can be controlled in several ways. The environment variable BGLMPI_MAPPING may be used to provide simple control of the mapping of processes, relative to their rank in MPI_COMM_WORLD, to processors and nodes. For example, when the system is running in virtual node mode, the processes can be assigned with consequtive pairs on the same node (that is, ranks 0 and 1 on the first node, ranks 2 and 3 on the second) with the mpirun option
    -env BGLMPI_MAPPING=TXYZ
MCS Division Argonne National Laboratory University of Chicago