Abstract
We know what it feels like to be under pressure. Try out a few quick and proven optimization stunts described below. They may provide a good enough performance gain right away.
You have full access to this open access chapter, Download chapter PDF
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
We know what it feels like to be under pressure. Try out a few quick and proven optimization stunts described below. They may provide a good enough performance gain right away.
There are several parameters that can be adjusted with relative ease. Here are the steps we follow when hard pressed:
-
Got more time? Tune Intel MPI:
-
Collect built-in statistics data
-
Tune Intel MPI process placement and pinning
-
Tune OpenMP thread pinning
-
-
Got still more time? Tune Intel Composer XE:
-
Analyze optimization and vectorization reports
-
Use interprocedural optimization
-
Using Intel MPI Library
The Intel MPI Library delivers good out-of-the-box performance for bandwidth-bound applications. If your application belongs to this popular class, you should feel the difference immediately when switching over.
If your application has been built for Intel MPI compatible distributions like MPICH, 3 MVAPICH2, 4 or IBM POE, 5 and some others, there is no need to recompile the application. You can switch by dynamically linking the Intel MPI 5.0 libraries at runtime:
$ source /opt/intel/impi_latest/bin64/mpivars.sh
$ mpirun -np 16 -ppn 2 xhpl
If you use another MPI and have access to the application source code, you can rebuild your application using Intel MPI compiler scripts:
-
Use mpicc (for C), mpicxx (for C++), and mpifc/mpif77/mpif90 (for Fortran) if you target GNU compilers.
-
Use mpiicc, mpiicpc, and mpiifort if you target Intel Composer XE.
Using Intel Composer XE
The invocation of the Intel Composer XE is largely compatible with the widely used GNU Compiler Collection (GCC). This includes both the most commonly used command line options and the language support for C/C++ and Fortran. For many applications you can simply replace gcc with icc, g++ with icpc, and gfortran with ifort. However, be aware that although the binary code generated by Intel C/C++ Composer XE is compatible with the GCC-built executable code, the binary code generated by the Intel Fortran Composer is not.
For example:
$ source /opt/intel/composerxe/bin/compilervars.sh intel64
$ icc -O3 -xHost -qopenmpĀ Ā -c example.o example.c
Revisit the compiler flags you used before the switch; you may have to remove some of them. Make sure that Intel Composer XE is invoked with the flags that give the best performance for your application (see Table 1-1). More information can be found in the Intel Composer XE documentation. 6
For most applications, the default optimization level of -O2 will suffice. It runs fast and gives reasonable performance. If you feel adventurous, try -O3. It is more aggressive but it also increases the compilation time.
Tuning Intel MPI Library
If you have more time, you can try to tune Intel MPI parameters without changing the application source code.
Gather Built-in Statistics
Intel MPI comes with a built-in statistics-gathering mechanism. It creates a negligible runtime overhead and reports key performance metrics (for example, MPI to computation ratio, message sizes, counts, and collective operations used) in the popular IPM format. 7
To switch the IPM statistics gathering mode on and do the measurements, enter the following commands:
$ export I_MPI_STATS=ipm
$ mpirun -np 16 xhpl
By default, this will generate a file called stats.ipm. Listing 1-1 shows an example of the MPI statistics gathered for the well-known High Performance Linpack (HPL) benchmark. 8 (We will return to this benchmark throughout this book, by the way.)
Listing 1-1. MPI Statistics for the HPL Benchmark with the Most Interesting Fields Highlighted
Intel(R) MPI Library Version 5.0
Summary MPI Statistics
Stats format: region
Stats scope : full
############################################################################
#
# command : /home/book/hpl/./xhpl_hybrid_intel64_dynamic (completed)
# hostĀ Ā Ā Ā : esg066/x86_64_LinuxĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā mpi_tasks : 16 on 8 nodes
# startĀ Ā Ā : 02/14/14/12:43:33Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā wallclock : 2502.401419 sec
# stopĀ Ā Ā Ā : 02/14/14/13:25:16Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā %commĀ Ā Ā Ā Ā : 8.43
# gbytesĀ Ā : 0.00000e+00 totalĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā gflop/sec : NA
#
############################################################################
# regionĀ Ā : *Ā Ā Ā [ntasks] = 16
#
#Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā [total]Ā Ā Ā Ā Ā Ā Ā <avg>Ā Ā Ā Ā Ā Ā Ā Ā Ā minĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā max
# entriesĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 16Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 1Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 1Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 1
# wallclockĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 40034.7Ā Ā Ā Ā Ā Ā Ā 2502.17Ā Ā Ā Ā Ā Ā Ā 2502.13Ā Ā Ā Ā Ā Ā Ā 2502.4
# userĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 446800Ā Ā Ā Ā Ā Ā Ā Ā 27925Ā Ā Ā Ā Ā Ā Ā Ā Ā 27768.4Ā Ā Ā Ā Ā Ā Ā 28192.7
# systemĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 1971.27Ā Ā Ā Ā Ā Ā Ā 123.205Ā Ā Ā Ā Ā Ā Ā 102.103Ā Ā Ā Ā Ā Ā Ā 145.241
# mpiĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 3375.05Ā Ā Ā Ā Ā Ā Ā 210.941Ā Ā Ā Ā Ā Ā Ā 132.327Ā Ā Ā Ā Ā Ā Ā 282.462
# %commĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 8.43032Ā Ā Ā Ā Ā Ā Ā 5.28855Ā Ā Ā Ā Ā Ā Ā 11.2888
# gflop/secĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā NAĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā NAĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā NAĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā NA
# gbytesĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0
#
#
#Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā [time]Ā Ā Ā Ā Ā Ā Ā Ā [calls]Ā Ā Ā Ā Ā Ā Ā <%mpi>Ā Ā Ā Ā Ā Ā Ā Ā <%wall>
# MPI_SendĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 2737.24Ā Ā Ā Ā Ā Ā Ā 1.93777e+06Ā Ā Ā 81.10Ā Ā Ā Ā Ā Ā Ā Ā Ā 6.84
# MPI_RecvĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 394.827Ā Ā Ā Ā Ā Ā Ā 16919Ā Ā Ā Ā Ā Ā Ā Ā Ā 11.70Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.99
# MPI_WaitĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 236.568Ā Ā Ā Ā Ā Ā Ā 1.92085e+06Ā Ā Ā 7.01Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.59
# MPI_IprobeĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 3.2257Ā Ā Ā Ā Ā Ā Ā Ā 6.57506e+06Ā Ā Ā 0.10Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.01
# MPI_Init_threadĀ Ā Ā Ā Ā Ā Ā Ā Ā 1.55628Ā Ā Ā Ā Ā Ā Ā 16Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.05Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00
# MPI_IrecvĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 1.31957Ā Ā Ā Ā Ā Ā Ā 1.92085e+06Ā Ā Ā 0.04Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00
# MPI_Type_commitĀ Ā Ā Ā Ā Ā Ā Ā Ā 0.212124Ā Ā Ā Ā Ā Ā 14720Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.01Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00
# MPI_Type_freeĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.0963376Ā Ā Ā Ā Ā 14720Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00
# MPI_Comm_splitĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.0065608Ā Ā Ā Ā Ā 48Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00
# MPI_Comm_freeĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.000276804Ā Ā Ā 48Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00
# MPI_WtimeĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 9.67979e-05Ā Ā Ā 48Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00
# MPI_Comm_sizeĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 9.13143e-05Ā Ā Ā 452Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00
# MPI_Comm_rankĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 7.77245e-05Ā Ā Ā 452Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00
# MPI_FinalizeĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 6.91414e-06Ā Ā Ā 16Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 0.00
# MPI_TOTALĀ Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 3375.05Ā Ā Ā Ā Ā Ā Ā 1.2402e+07Ā Ā Ā Ā 100.00Ā Ā Ā Ā Ā Ā Ā Ā 8.43
############################################################################
From Listing 1-1 you can deduce that MPI communication occupies between 5.3 and 11.3 percent of the total runtime, and that the MPI_Send, MPI_Recv, and MPI_Wait operations take about 81, 12, and 7 percent, respectively, of the total MPI time. With this data at hand, you can see that there are potential load imbalances between the job processes, and that you should focus on making the MPI_Send operation as fast as it can go to achieve a noticeable performance hike.
Note that if you use the full IPM package instead of the built-in statistics, you will also get data on the total communication volume and floating point performance that are not measured by the Intel MPI Library.
Optimize Process Placement
The Intel MPI Library puts adjacent MPI ranks on one cluster node as long as there are cores to occupy. Use the Intel MPI command line argument -ppn to control the process placement across the cluster nodes. For example, this command will start two processes per node:
$ mpirun -np 16 -ppn 2 xhpl
Intel MPI supports process pinning to restrict the MPI ranks to parts of the system so as to optimize process layout (for example, to avoid NUMA effects or to reduce latency to the InfiniBand adapter). Many relevant settings are described in the Intel MPI Library Reference Manual. 9
Briefly, if you want to run a pure MPI program only on the physical processor cores, enter the following commands:
$ export I_MPI_PIN_PROCESSOR_LIST=allcores
$ mpirun -np 2 your_MPI_app
If you want to run a hybrid MPI/OpenMP program, donāt change the default Intel MPI settings, and see the next section for the OpenMP ones.
If you want to analyze Intel MPI process layout and pinning, set the following environment variable:
$ export I_MPI_DEBUG=4
Optimize Thread Placement
If the application uses OpenMP for multithreading, you may want to control thread placement in addition to the process placement. Two possible strategies are:
$ export KMP_AFFINITY=granularity=thread,compact
$ export KMP_AFFINITY=granularity=thread,scatter
The first setting keeps threads close together to improve inter-thread communication, while the second setting distributes the threads across the system to maximize memory bandwidth.
Programs that use the OpenMP API version 4.0 can use the equivalent OpenMP affinity settings instead of the KMP_AFFINITY environment variable:
$ export OMP_PROC_BIND=close
$ export OMP_PROC_BIND=spread
If you use I_MPI_PIN_DOMAIN, MPI will confine the OpenMP threads of an MPI process on a single socket. Then you can use the following setting to avoid thread movement between the logical cores of the socket:
$ export KMP_AFFINITY=granularity=thread
Tuning Intel Composer XE
If you have access to the source code of the application, you can perform optimizations by selecting appropriate compiler switches and recompiling the source code.
Analyze Optimization and Vectorization Reports
Add compiler flags -qopt-report and/or -vec-report to see what the compiler did to your source code. This will report all the transformations applied to your code. It will also highlight those code patterns that prevented successful optimization. Address them if you have time left.
Here is a small example. Because the optimization report may be very long, Listing 1-2 only shows an excerpt from it. The example code contains several loop nests of seven loops. The compiler found an OpenMP directive to parallelize the loop nest. It also recognized that the overall loop nest was not optimal, and it automatically permuted some loops to improve the situation for vectorization. Then it vectorized all inner-most loops while leaving the outer-most loops as they are.
Listing 1-2. Example Optimization Report with the Most Interesting Fields Highlighted
$ ifort -O3 -qopenmp -qopt-report -qopt-report-file=stdout -c example.F90
Ā Ā Ā Ā Report from: Interprocedural optimizations [ipo]
[...]
OpenMP Construct at example.F90(8,7)
remark #15059: OpenMP DEFINED LOOP WAS PARALLELIZED
OpenMP Construct at example.F90(25,7)
remark #15059: OpenMP DEFINED LOOP WAS PARALLELIZED
[...]
LOOP BEGIN at example.F90(9,2)
Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā LOOP BEGIN at example.F90(12,5)
Ā Ā Ā Ā Ā Ā remark #25448: Loopnest Interchanged : (1 2 3 4) --> (1 4 2 3)
Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(12,5)
Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
[...]
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(15,8)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #25446: blocked by 125Ā Ā Ā (pre-vector)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #25444: unrolled and jammed by 4Ā Ā Ā (pre-vector)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(13,6)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #25446: blocked by 125Ā Ā Ā (pre-vector)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(14,7)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #25446: blocked by 128Ā Ā Ā (pre-vector)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15003: PERMUTED LOOP WAS VECTORIZED
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(14,7)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Remainder
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #25460: Loop was not optimized
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
[...]
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā LOOP END
LOOP END
LOOP BEGIN at example.F90(26,2)
Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā LOOP BEGIN at example.F90(29,5)
Ā Ā Ā Ā Ā Ā remark #25448: Loopnest Interchanged : (1 2 3 4) --> (1 3 2 4)
Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(29,5)
Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(29,5)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(29,5)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(29,5)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #25446: blocked by 125Ā Ā Ā (pre-vector)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #25444: unrolled and jammed by 4Ā Ā Ā (pre-vector)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
[...]
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā LOOP END
LOOP END
Listing 1-3 shows the vectorization report for the example in Listing 1-2. As you can see, the vectorization report contains the same information about vectorization as the optimization report.
Listing 1-3. Example Vectorization Report with the Most Interesting Fields Highlighted
$ ifort -O3 -qopenmp -vec-report=2 -qopt-report-file=stdout -c example.F90
[...]
LOOP BEGIN at example.F90(9,2)
Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā LOOP BEGIN at example.F90(12,5)
Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(12,5)
Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(12,5)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(12,5)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(12,5)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(15,8)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(13,6)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(14,7)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15003: PERMUTED LOOP WAS VECTORIZED
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
[...]
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(15,8)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Remainder
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(13,6)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15018: loop was not vectorized: not inner loop
[...]
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP BEGIN at example.F90(14,7)
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā remark #15003: PERMUTED LOOP WAS VECTORIZED
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
[...]
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
[...]
Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā Ā Ā Ā LOOP END
Ā Ā Ā LOOP END
LOOP END
[...]
Use Interprocedural Optimization
Add the compiler flag -ipo to switch on interprocedural optimization. This will give the compiler a holistic view of the program and open more optimization opportunities for the program as a whole. Note that this will also increase the overall compilation time.
Runtime profiling can also increase the chances for the compiler to generate better code. Profile-guided optimization requires a three-stage process. First, compile the application with the compiler flag -prof-gen to instrument the application with profiling code. Second, run the instrumented application with a typical dataset to produce a meaningful profile. Third, feed the compiler with the profile (-prof-use) and let it optimize the code.
Summary
Switching to Intel MPI and Intel Composer XE can help improve performance because the two strive to optimally support Intel platforms and deliver good out-of-the-box (OOB) performance. Tuning measures can further improve the situation. The next chapters will reiterate the quick and dirty examples of this chapter and show you how to push the limits.
References
-
1.
Intel Corporation, āIntel(R) MPI Library,ā http://software.intel.com/en-us/intel-mpi-library .
-
2.
Intel Corporation, āIntel(R) Composer XE Suites,ā http://software.intel.com/en-us/intel-composer-xe .
-
3.
Argonne National Laboratory, āMPICH: High-Performance Portable MPI,ā www.mpich.org .
-
4.
Ohio State University, āMVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE,ā http://mvapich.cse.ohio-state.edu/overview/mvapich2/ .
-
5.
International Business Machines Corporation, āIBM Parallel Environment,ā www-03.ibm.com/systems/software/parallel/ .
-
6.
Intel Corporation, āIntel Fortran Composer XE 2013 - Documentation,ā http://software.intel.com/articles/intel-fortran-composer-xe-documentation/ .
-
7.
The IPM Developers, āIntegrated Performance Monitoring - IPM,ā http://ipm-hpc.sourceforge.net/ .
-
8.
A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, āHPL : A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers,ā 10 September 2008, www.netlib.org/benchmark/hpl/ .
-
9.
Intel Corporation, āIntel MPI Library Reference Manual,ā http://software.intel.com/en-us/node/500285 .
Author information
Authors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this chapter or parts of it.
The images or other third party material in this chapter are included in the chapterās Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapterās Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
Ā© 2014 Alexander Supalov
About this chapter
Cite this chapter
Supalov, A., Semin, A., Klemm, M., Dahnken, C. (2014). No Time to Read This Book?. In: Optimizing HPC Applications with IntelĀ® Cluster Tools. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4302-6497-2_1
Download citation
DOI: https://doi.org/10.1007/978-1-4302-6497-2_1
Published:
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4302-6496-5
Online ISBN: 978-1-4302-6497-2
eBook Packages: Professional and Applied ComputingProfessional and Applied Computing (R0)Apress Access Books