Abstract
LA-MPI is a high-performance, network-fault-tolerant implementation of MPI designed for terascale clusters that are inherently unreliable due to their very large number of system components and to trade-offs between cost and performance. This paper reviews the architectural design of LA-MPI, focusing on our approach to guaranteeing data integrity. We discuss our network data path abstraction that makes LA-MPI highly portable, gives high-performance through message striping, and most importantly provides the basis for network fault tolerance. Finally we include some performance numbers for Quadrics Elan, Myrinet GM and UDP network data paths.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Graham, R.L., Choi, S.-E., Daniel, D.J., Desai, N.N., Minnich, R.G., Rasmussen, C.E., Dean Risinger, L., Sukalski, M.W.: A network-failure-tolerant message-passing system for terascale clusters. In: Proceedings of the 16th international conference on Supercomputing, pp. 77–83. ACM Press, New York (2002)
Aulwes, R.T., Daniel, D.J., Desai, N.N., Graham, R.L., Dean Risinger, L., Sukalski, M.W.: LA-MPI: The design and implementation of a network-fault-tolerant MPI for terascale clusters. Technical Report LA-UR-03-0939, Los Alamos National Laboratory (2003)
Message Passing Interface Forum. MPI: A Message Passing Interface Standard. Technical report (1994)
Message Passing Interface Forum. MPI-2.0: Extensions to the Message-Passing Interface. Technical report (1997)
Partridge, C., Hughes, J., Stone, J.: Performance of checksums and CRCs over real data. Computer Communication Review 25(4), 68–76 (1995)
Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996), Honolulu, Hawaii (1996)
Litzkow, M., Livny, M., Mutka, M.: Condor - a hunter of idle workstations. In: 8th International Conference on Distributed Computing System, pp. 108–111. IEEE Computer Society Press, Los Alamitos (1988)
Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: 8th IEEE International Symposium on High Performance Distributed Computing (1999)
Fagg, G., Dongarra, J.: FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In: EuroPVM/ MPI User’s Group Meeting 2000, Springer, Heidelberg (2000)
Thakur, R., Gropp, W., Lusk, E.: Users Guide for ROMIO: A High-Performance, Portable MPI-IO Implementation. Mathematics and Computer Science Division, Argonne National Laboratory, ANL/MCS-TM- 234 (October 1997)
Quadrics Ltd, http://www.quadrics.com/
Petrini, F., Feng, W.-C., Hoisie, A., Coll, S., Frachtenberg, E.: The Quadrics network: High-performance clustering technology. IEEE Micro 22(1), 46–57 (2002)
Myricom, Inc., http://www.myri.com/
Advanced Computing Laboratory, Los Alamos National Laboratory, http://public.lanl.gov/cluster/index.html
Advanced Computing Laboratory, Los Alamos National Laboratory, http://www.acl.lanl.gov/la-mpi
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Aulwes, R.T. et al. (2003). Network Fault Tolerance in LA-MPI. In: Dongarra, J., Laforenza, D., Orlando, S. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2003. Lecture Notes in Computer Science, vol 2840. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39924-7_48
Download citation
DOI: https://doi.org/10.1007/978-3-540-39924-7_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20149-6
Online ISBN: 978-3-540-39924-7
eBook Packages: Springer Book Archive