Abstract
We describe and experimentally compare four theoretically well-known algorithms for the parallel prefix operation (scan, in MPI terms), and give a presumably novel, doubly-pipelined implementation of the in-order binary tree parallel prefix algorithm. Bidirectional interconnects can benefit from this implementation. We present results from a 32 node AMD Cluster with Myrinet 2000 and a 72-node SX-8 parallel vector system. The doubly-pipelined algorithm is more than a factor two faster than the straight-forward binomial-tree algorithm found in many MPI implementations. However, due to its small constant factors the simple, linear pipeline algorithm is preferable for systems with a moderate number of processors. We also discuss adapting the algorithms to clusters of SMP nodes.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bae, S., Kim, D., Ranka, S.: Vector prefix and reduction computation on coarse-grained, distributed memory machines. In: International Parallel Processing Symposium/Symposium on Parallel and Distributed Processing (IPPS/SPDP 1998), pp. 321–325 (1998)
Blelloch, G.E.: Scans as primitive parallel operations. IEEE Transactions on Computers 38(11), 1526–1538 (1989)
Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E., Nitzberg, B., Saphir, W., Snir, M.: MPI – The Complete Reference. In: The MPI Extensions, vol. 2, MIT Press, Cambridge (1998)
Hillis, W.D., Steele, J.G.L.: Data parallel algorithms. Communications of the ACM 29(12), 1170–1183 (1986)
JáJá, J.: An Introduction to Parallel Algorithms. Addison-Wesley, Reading (1992)
Lin, Y.-C., Yeh, C.-S.: Efficient parallel prefix algorithms on multiport message-passing systems. Information Processing Letters 71, 91–95 (1999)
Mayr, E.W., Plaxton, C.G.: Pipelined parallel prefix computations, and sorting on a pipelined hypercube. Journal of Parallel and Distributed Computing 17, 374–380 (1993)
Sanders, P., Sibeyn, J.F.: A bandwidth latency tradeoff for broadcast and reduction. Information Processing Letters 86(1), 33–38 (2003)
Santos, E.E.: Optimal and efficient algorithms for summing and prefix summing on parallel machines. Journal of Parallel and Distributed Computing 62(4), 517–543 (2002)
Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI – The Complete Reference. In: The MPI Core, 2nd edn., vol. 1. MIT Press, Cambridge (1998)
Worringen, J.: Pipelining and overlapping for MPI collective operations. In: 28th Annual IEEE Conference on Local Computer Networks (LCN 2003), pp. 548–557 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sanders, P., Träff, J.L. (2006). Parallel Prefix (Scan) Algorithms for MPI. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2006. Lecture Notes in Computer Science, vol 4192. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846802_15
Download citation
DOI: https://doi.org/10.1007/11846802_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39110-4
Online ISBN: 978-3-540-39112-8
eBook Packages: Computer ScienceComputer Science (R0)