Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

We know what it feels like to be under pressure. Try out a few quick and proven optimization stunts described below. They may provide a good enough performance gain right away.

There are several parameters that can be adjusted with relative ease. Here are the steps we follow when hard pressed:

  • Use Intel MPI Library 1 and Intel Composer XE 2

  • Got more time? Tune Intel MPI:

    • Collect built-in statistics data

    • Tune Intel MPI process placement and pinning

    • Tune OpenMP thread pinning

  • Got still more time? Tune Intel Composer XE:

    • Analyze optimization and vectorization reports

    • Use interprocedural optimization

Using Intel MPI Library

The Intel MPI Library delivers good out-of-the-box performance for bandwidth-bound applications. If your application belongs to this popular class, you should feel the difference immediately when switching over.

If your application has been built for Intel MPI compatible distributions like MPICH, 3 MVAPICH2, 4 or IBM POE, 5 and some others, there is no need to recompile the application. You can switch by dynamically linking the Intel MPI 5.0 libraries at runtime:

$ source /opt/intel/impi_latest/bin64/mpivars.sh

$ mpirun -np 16 -ppn 2 xhpl

If you use another MPI and have access to the application source code, you can rebuild your application using Intel MPI compiler scripts:

  • Use mpicc (for C), mpicxx (for C++), and mpifc/mpif77/mpif90 (for Fortran) if you target GNU compilers.

  • Use mpiicc, mpiicpc, and mpiifort if you target Intel Composer XE.

Using Intel Composer XE

The invocation of the Intel Composer XE is largely compatible with the widely used GNU Compiler Collection (GCC). This includes both the most commonly used command line options and the language support for C/C++ and Fortran. For many applications you can simply replace gcc with icc, g++ with icpc, and gfortran with ifort. However, be aware that although the binary code generated by Intel C/C++ Composer XE is compatible with the GCC-built executable code, the binary code generated by the Intel Fortran Composer is not.

For example:

$ source /opt/intel/composerxe/bin/compilervars.sh intel64

$ icc -O3 -xHost -qopenmpĀ Ā -c example.o example.c

Revisit the compiler flags you used before the switch; you may have to remove some of them. Make sure that Intel Composer XE is invoked with the flags that give the best performance for your application (see Table 1-1). More information can be found in the Intel Composer XE documentation. 6

Table 1-1. Selected Intel Composer XE Optimization Flags

For most applications, the default optimization level of -O2 will suffice. It runs fast and gives reasonable performance. If you feel adventurous, try -O3. It is more aggressive but it also increases the compilation time.

Tuning Intel MPI Library

If you have more time, you can try to tune Intel MPI parameters without changing the application source code.

Gather Built-in Statistics

Intel MPI comes with a built-in statistics-gathering mechanism. It creates a negligible runtime overhead and reports key performance metrics (for example, MPI to computation ratio, message sizes, counts, and collective operations used) in the popular IPM format. 7

To switch the IPM statistics gathering mode on and do the measurements, enter the following commands:

$ export I_MPI_STATS=ipm

$ mpirun -np 16 xhpl

By default, this will generate a file called stats.ipm. Listing 1-1 shows an example of the MPI statistics gathered for the well-known High Performance Linpack (HPL) benchmark. 8 (We will return to this benchmark throughout this book, by the way.)

Listing 1-1. MPI Statistics for the HPL Benchmark with the Most Interesting Fields Highlighted

Intel(R) MPI Library Version 5.0

Summary MPI Statistics

Stats format: region

Stats scope : full

############################################################################

#

# command : /home/book/hpl/./xhpl_hybrid_intel64_dynamic (completed)

# hostĀ Ā Ā Ā : esg066/x86_64_LinuxĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā mpi_tasks : 16 on 8 nodes

# startĀ Ā Ā : 02/14/14/12:43:33Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā wallclock : 2502.401419 sec

# stopĀ Ā Ā Ā : 02/14/14/13:25:16Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā %commĀ Ā Ā Ā Ā : 8.43

# gbytesĀ Ā : 0.00000e+00 totalĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā gflop/sec : NA

#

############################################################################

# regionĀ Ā : *Ā Ā Ā [ntasks] = 16

#

#Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā [total]Ā Ā Ā Ā Ā Ā Ā <avg>Ā Ā Ā Ā Ā Ā Ā Ā Ā minĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā max

# entriesĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 16Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 1Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 1Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 1

# wallclockĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 40034.7Ā Ā Ā Ā Ā Ā Ā 2502.17Ā Ā Ā Ā Ā Ā Ā 2502.13Ā Ā Ā Ā Ā Ā Ā 2502.4

# userĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 446800Ā Ā Ā Ā Ā Ā Ā Ā 27925Ā Ā Ā Ā Ā Ā Ā Ā Ā 27768.4Ā Ā Ā Ā Ā Ā Ā 28192.7

# systemĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 1971.27Ā Ā Ā Ā Ā Ā Ā 123.205Ā Ā Ā Ā Ā Ā Ā 102.103Ā Ā Ā Ā Ā Ā Ā 145.241

# mpiĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 3375.05Ā Ā Ā Ā Ā Ā Ā 210.941Ā Ā Ā Ā Ā Ā Ā 132.327Ā Ā Ā Ā Ā Ā Ā 282.462

# %commĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 8.43032Ā Ā Ā Ā Ā Ā Ā 5.28855Ā Ā Ā Ā Ā Ā Ā 11.2888

# gflop/secĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā NAĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā NAĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā NAĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā NA

# gbytesĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0

#

#

#Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā [time]Ā Ā Ā Ā Ā Ā Ā Ā [calls]Ā Ā Ā Ā Ā Ā Ā <%mpi>Ā Ā Ā Ā Ā Ā Ā Ā <%wall>

# MPI_SendĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 2737.24Ā Ā Ā Ā Ā Ā Ā 1.93777e+06Ā Ā Ā 81.10Ā Ā Ā Ā Ā Ā Ā Ā Ā 6.84

# MPI_RecvĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 394.827Ā Ā Ā Ā Ā Ā Ā 16919Ā Ā Ā Ā Ā Ā Ā Ā Ā 11.70Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.99

# MPI_WaitĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 236.568Ā Ā Ā Ā Ā Ā Ā 1.92085e+06Ā Ā Ā 7.01Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.59

# MPI_IprobeĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 3.2257Ā Ā Ā Ā Ā Ā Ā Ā 6.57506e+06Ā Ā Ā 0.10Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.01

# MPI_Init_threadĀ Ā Ā Ā Ā Ā Ā Ā Ā 1.55628Ā Ā Ā Ā Ā Ā Ā 16Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.05Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00

# MPI_IrecvĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 1.31957Ā Ā Ā Ā Ā Ā Ā 1.92085e+06Ā Ā Ā 0.04Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00

# MPI_Type_commitĀ Ā Ā Ā Ā Ā Ā Ā Ā 0.212124Ā Ā Ā Ā Ā Ā 14720Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.01Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00

# MPI_Type_freeĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.0963376Ā Ā Ā Ā Ā 14720Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00

# MPI_Comm_splitĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.0065608Ā Ā Ā Ā Ā 48Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00

# MPI_Comm_freeĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.000276804Ā Ā Ā 48Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00

# MPI_WtimeĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 9.67979e-05Ā Ā Ā 48Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00

# MPI_Comm_sizeĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 9.13143e-05Ā Ā Ā 452Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00

# MPI_Comm_rankĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 7.77245e-05Ā Ā Ā 452Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00

# MPI_FinalizeĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 6.91414e-06Ā Ā Ā 16Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00

# MPI_TOTALĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 3375.05Ā Ā Ā Ā Ā Ā Ā 1.2402e+07Ā Ā Ā Ā 100.00Ā Ā Ā Ā Ā Ā Ā Ā 8.43

############################################################################

From Listing 1-1 you can deduce that MPI communication occupies between 5.3 and 11.3 percent of the total runtime, and that the MPI_Send, MPI_Recv, and MPI_Wait operations take about 81, 12, and 7 percent, respectively, of the total MPI time. With this data at hand, you can see that there are potential load imbalances between the job processes, and that you should focus on making the MPI_Send operation as fast as it can go to achieve a noticeable performance hike.

Note that if you use the full IPM package instead of the built-in statistics, you will also get data on the total communication volume and floating point performance that are not measured by the Intel MPI Library.

Optimize Process Placement

The Intel MPI Library puts adjacent MPI ranks on one cluster node as long as there are cores to occupy. Use the Intel MPI command line argument -ppn to control the process placement across the cluster nodes. For example, this command will start two processes per node:

$ mpirun -np 16 -ppn 2 xhpl

Intel MPI supports process pinning to restrict the MPI ranks to parts of the system so as to optimize process layout (for example, to avoid NUMA effects or to reduce latency to the InfiniBand adapter). Many relevant settings are described in the Intel MPI Library Reference Manual. 9

Briefly, if you want to run a pure MPI program only on the physical processor cores, enter the following commands:

$ export I_MPI_PIN_PROCESSOR_LIST=allcores

$ mpirun -np 2 your_MPI_app

If you want to run a hybrid MPI/OpenMP program, donā€™t change the default Intel MPI settings, and see the next section for the OpenMP ones.

If you want to analyze Intel MPI process layout and pinning, set the following environment variable:

$ export I_MPI_DEBUG=4

Optimize Thread Placement

If the application uses OpenMP for multithreading, you may want to control thread placement in addition to the process placement. Two possible strategies are:

$ export KMP_AFFINITY=granularity=thread,compact

$ export KMP_AFFINITY=granularity=thread,scatter

The first setting keeps threads close together to improve inter-thread communication, while the second setting distributes the threads across the system to maximize memory bandwidth.

Programs that use the OpenMP API version 4.0 can use the equivalent OpenMP affinity settings instead of the KMP_AFFINITY environment variable:

$ export OMP_PROC_BIND=close

$ export OMP_PROC_BIND=spread

If you use I_MPI_PIN_DOMAIN, MPI will confine the OpenMP threads of an MPI process on a single socket. Then you can use the following setting to avoid thread movement between the logical cores of the socket:

$ export KMP_AFFINITY=granularity=thread

Tuning Intel Composer XE

If you have access to the source code of the application, you can perform optimizations by selecting appropriate compiler switches and recompiling the source code.

Analyze Optimization and Vectorization Reports

Add compiler flags -qopt-report and/or -vec-report to see what the compiler did to your source code. This will report all the transformations applied to your code. It will also highlight those code patterns that prevented successful optimization. Address them if you have time left.

Here is a small example. Because the optimization report may be very long, Listing 1-2 only shows an excerpt from it. The example code contains several loop nests of seven loops. The compiler found an OpenMP directive to parallelize the loop nest. It also recognized that the overall loop nest was not optimal, and it automatically permuted some loops to improve the situation for vectorization. Then it vectorized all inner-most loops while leaving the outer-most loops as they are.

Listing 1-2. Example Optimization Report with the Most Interesting Fields Highlighted

$ ifort -O3 -qopenmp -qopt-report -qopt-report-file=stdout -c example.F90

Ā Ā Ā Ā Report from: Interprocedural optimizations [ipo]

[...]

OpenMP Construct at example.F90(8,7)

remark #15059: OpenMP DEFINED LOOP WAS PARALLELIZED

OpenMP Construct at example.F90(25,7)

remark #15059: OpenMP DEFINED LOOP WAS PARALLELIZED

[...]

LOOP BEGIN at example.F90(9,2)

Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā LOOP BEGIN at example.F90(12,5)

Ā Ā Ā Ā Ā Ā remark #25448: Loopnest Interchanged : (1 2 3 4) --> (1 4 2 3)

Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(12,5)

Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

[...]

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(15,8)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #25446: blocked by 125Ā Ā Ā (pre-vector)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #25444: unrolled and jammed by 4Ā Ā Ā (pre-vector)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(13,6)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #25446: blocked by 125Ā Ā Ā (pre-vector)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(14,7)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #25446: blocked by 128Ā Ā Ā (pre-vector)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15003: PERMUTED LOOP WAS VECTORIZED

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(14,7)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Remainder

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #25460: Loop was not optimized

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

[...]

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā LOOP END

LOOP END

LOOP BEGIN at example.F90(26,2)

Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā LOOP BEGIN at example.F90(29,5)

Ā Ā Ā Ā Ā Ā remark #25448: Loopnest Interchanged : (1 2 3 4) --> (1 3 2 4)

Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(29,5)

Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(29,5)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(29,5)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(29,5)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #25446: blocked by 125Ā Ā Ā (pre-vector)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #25444: unrolled and jammed by 4Ā Ā Ā (pre-vector)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

[...]

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā LOOP END

LOOP END

Listing 1-3 shows the vectorization report for the example in Listing 1-2. As you can see, the vectorization report contains the same information about vectorization as the optimization report.

Listing 1-3. Example Vectorization Report with the Most Interesting Fields Highlighted

$ ifort -O3 -qopenmp -vec-report=2 -qopt-report-file=stdout -c example.F90

[...]

LOOP BEGIN at example.F90(9,2)

Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā LOOP BEGIN at example.F90(12,5)

Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(12,5)

Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(12,5)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(12,5)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(12,5)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(15,8)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(13,6)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(14,7)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15003: PERMUTED LOOP WAS VECTORIZED

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

[...]

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(15,8)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Remainder

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(13,6)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop

[...]

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(14,7)

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15003: PERMUTED LOOP WAS VECTORIZED

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

[...]

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

[...]

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā Ā Ā Ā LOOP END

Ā Ā Ā LOOP END

LOOP END

[...]

Use Interprocedural Optimization

Add the compiler flag -ipo to switch on interprocedural optimization. This will give the compiler a holistic view of the program and open more optimization opportunities for the program as a whole. Note that this will also increase the overall compilation time.

Runtime profiling can also increase the chances for the compiler to generate better code. Profile-guided optimization requires a three-stage process. First, compile the application with the compiler flag -prof-gen to instrument the application with profiling code. Second, run the instrumented application with a typical dataset to produce a meaningful profile. Third, feed the compiler with the profile (-prof-use) and let it optimize the code.

Summary

Switching to Intel MPI and Intel Composer XE can help improve performance because the two strive to optimally support Intel platforms and deliver good out-of-the-box (OOB) performance. Tuning measures can further improve the situation. The next chapters will reiterate the quick and dirty examples of this chapter and show you how to push the limits.

References

  1. 1.

    Intel Corporation, ā€œIntel(R) MPI Library,ā€ http://software.intel.com/en-us/intel-mpi-library .

  2. 2.

    Intel Corporation, ā€œIntel(R) Composer XE Suites,ā€ http://software.intel.com/en-us/intel-composer-xe .

  3. 3.

    Argonne National Laboratory, ā€œMPICH: High-Performance Portable MPI,ā€ www.mpich.org .

  4. 4.

    Ohio State University, ā€œMVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE,ā€ http://mvapich.cse.ohio-state.edu/overview/mvapich2/ .

  5. 5.

    International Business Machines Corporation, ā€œIBM Parallel Environment,ā€ www-03.ibm.com/systems/software/parallel/ .

  6. 6.

    Intel Corporation, ā€œIntel Fortran Composer XE 2013 - Documentation,ā€ http://software.intel.com/articles/intel-fortran-composer-xe-documentation/ .

  7. 7.

    The IPM Developers, ā€œIntegrated Performance Monitoring - IPM,ā€ http://ipm-hpc.sourceforge.net/ .

  8. 8.

    A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, ā€œHPL : A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers,ā€ 10 September 2008, www.netlib.org/benchmark/hpl/ .

  9. 9.

    Intel Corporation, ā€œIntel MPI Library Reference Manual,ā€ http://software.intel.com/en-us/node/500285 .