An Improved Algorithm for (Non-commutative) Reduce-Scatter with an Application

  • Jesper Larsson Träff
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3666)


The collective reduce-scatter operation in MPI performs an element-wise reduction using a given associative (and possibly commutative) binary operation of a sequence of m-element vectors, and distributes the result in m i sized blocks over the participating processors. For the case where the number of processors is a power of two, the binary operation is commutative, and all resulting blocks have the same size, efficient, butterfly-like algorithms are well-known and implemented in good MPI libraries.

The contributions of this paper are threefold. First, we give a simple trick for extending the butterfly algorithm also to the case of non-commutative operations (which is advantageous also for the commutative case). Second, combining this with previous work, we give improved algorithms for the case where the number of processors is not a power of two. Third, we extend the algorithms also to the irregular case where the size of the resulting blocks may differ extremely.

For p processors the algorithm requires ⌈log 2 p ⌉ + (⌈log 2 p ⌉ - \(\lfloor log_2p \rfloor\)) communication rounds for the regular case, which may double for the irregular case (depending on the amount of irregularity). For vectors of size m with \(m = \sum^{p-1}_{i=0}m_i\) the total running time is O(log p + m), irrespective of whether the m i blocks are equal or not. The algorithm has been implemented, and on a small Myrinet cluster gives substantial improvements (up to a factor of 3 in the experiments reported) over other often used implementations. The reduce-scatter operation is a building block in the fence one-sided communication synchronization primitive, and for this application we also document worthwhile improvements over a previous implementation.


Correct Process Improve Algorithm Reduction Operation Result Vector Regular Case 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bernaschi, M., Iannello, G., Lauria, M.: Efficient implementation of reduce-scatter in MPI. Technical report, University of Napoli (1997)Google Scholar
  2. 2.
    Gołebiewski, M., Ritzdorf, H., Träff, J.L., Zimmermann, F.: The MPI/SX implementation of MPI for NEC’s SX-6 and other NEC platforms. NEC Research & Development 44(1), 69–74 (2003)Google Scholar
  3. 3.
    Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E., Nitzberg, B., Saphir, W., Snir, M.: MPI – The Complete Reference, 2nd edn. The MPI Extensions. MIT Press, Cambridge (1998)Google Scholar
  4. 4.
    Gropp, W.D., Ross, R., Miller, N.: Providing efficient I/O redundancy in MPI environments. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 77–86. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  5. 5.
    Iannello, G.: Efficient algorithms for the reduce-scatter operation in LogGP. IEEE Transactions on Parallel and Distributed Systems 8(9), 970–982 (1997)CrossRefGoogle Scholar
  6. 6.
    Leighton, F.T.: Introduction to Parallel Algorithms and Architechtures: Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers, San Francisco (1992)zbMATHGoogle Scholar
  7. 7.
    Rabenseifner, R., Träff, J.L.: More efficient reduction algorithms for message-passing parallel systems. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 36–46. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  8. 8.
    Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI – The Complete Reference, 2nd edn. The MPI Core, vol. 1. MIT Press, Cambridge (1998)Google Scholar
  9. 9.
    Thakur, R., Gropp, W.D., Rabenseifner, R.: Improving the performance of collective operations in MPICH. International Journal on High Performance Computing Applications 19, 49–66 (2004)CrossRefGoogle Scholar
  10. 10.
    Thakur, R., Gropp, W.D., Toonen, B.: Minimizing synchronization overhead in the implementation of MPI one-sided communication. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 57–67. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  11. 11.
    Träff, J.L.: Hierarchical gather/scatter algorithms with graceful degradation. In: International Parallel and Distributed Processing Symposium, IPDPS 2004 (2004)Google Scholar
  12. 12.
    Träff, J.L., Ritzdorf, H., Hempel, R.: The implementation of MPI-2 one-sided communication for the NEC SX-5. In: Supercomputing (2000),

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Jesper Larsson Träff
    • 1
  1. 1.C&C Research Laboratories, NEC Europe LtdSankt AugustinGermany

Personalised recommendations