Optimization of Collective Reduction Operations

  • Rolf Rabenseifner
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3036)


A 5-year-profiling in production mode at the University of Stuttgart has shown that more than 40% of the execution time of Message Passing Interface (MPI) routines is spent in the collective communication routines MPI_Allreduce and MPI_Reduce. Although MPI implementations are now available for about 10 years and all vendors are committed to this Message Passing Interface standard, the vendors’ and publicly available reduction algorithms could be accelerated with new algorithms by a factor between 3 (IBM, sum) and 100 (Cray T3E, maxloc) for long vectors. This paper presents five algorithms optimized for different choices of vector size and number of processes. The focus is on bandwidth dominated protocols for power-of-two and non-power-of-two number of processes, optimizing the load balance in communication and computation.


Message Passing MPI Collective Operations Reduction 


  1. 1.
    Bala, V., Bruck, J., Cypher, R., Elustondo, P., Ho, A., Ho, C.-T., Kipnis, S., Snir, M.: CCL: A portable and tunable collective communication library for scalable parallel computers. IEEE Transactions on Parallel and Distributed Systems 6(2), 154–164 (1995)CrossRefGoogle Scholar
  2. 2.
    Barnett, M., Gupta, S., Payne, D., Shuler, L., van de Gejin, R., Watts, J.: Interprocessor collective communication library (InterCom). In: Proceedings of Supercomputing 1994 (November 1994)Google Scholar
  3. 3.
    Blum, E.K., Wang, X., Leung, P.: Architectures and message-passing algorithms for cluster computing: Design and performance. Parallel Computing 26, 313–332 (2000)zbMATHCrossRefGoogle Scholar
  4. 4.
    Bruck, J., Ho, C.-T., Kipnis, S., Upfal, E., Weathersby, D.: Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems 8(11), 1143–1156 (1997)CrossRefGoogle Scholar
  5. 5.
    Gabriel, E., Resch, M., Rühle, R.: Implementing MPI with optimized algorithms for metacomputing. In: Proceedings of the MPIDC 1999, Atlanta, USA, March 1999, pp. 31–41 (1999)Google Scholar
  6. 6.
    Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Rel. 1.1 (June 1995),
  7. 7.
    Karonis, N., de Supinski, B., Foster, I., Gropp, W., Lusk, E., Bresnahan, J.: Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In: Proceedings of the 14th International Parallel and Distributed Processing Symposium (IPDPS 2000), pp. 377–384 (2000)Google Scholar
  8. 8.
    Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., Bhoedjang, R.A.F.: MPI’s reduction operations in clustered wide area systems. In: Proceedings of the Message Passing Interface Developer’s and User’s Conference 1999 (MPIDC 1999), Atlanta, USA, March 1999, pp. 43–52 (1999)Google Scholar
  9. 9.
    Knies, M.D., Ray Barriuso, F., Harrod, W.J., Adams III, G.B.: SLICC: A low latency interface for collective communications. In: Proceedings of the 1994 conference on Supercomputing, Washington, D.C., November 14–18, pp. 89–96 (1994)Google Scholar
  10. 10.
    Rabenseifner, R.: A new optimized MPI reduce and allreduce algorithm (November 1997),
  11. 11.
    Rabenseifner, R.: Automatic MPI counter profiling of all users: First results on a CRAY T3E 900-512. In: Proceedings of the Message Passing Interface Developer’s and User’s Conference 1999 (MPIDC 1999), Atlanta, USA, March 1999, pp. 77–85 (1999),
  12. 12.
    Thakur, R., Gropp, W.D.: Gropp, Improving the performance of collective operations in MPICH. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 257–267. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  13. 13.
    Vadhiyar, S.S., Fagg, G.E., Dongarra, J.: Automatically tuned collective communications. In: Proceedings of SC 2000 (November 2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Rolf Rabenseifner
    • 1
  1. 1.High-Performance Computing-Center (HLRS)University of StuttgartStuttgartGermany

Personalised recommendations