Skip to main content

Network Fault Tolerance in LA-MPI

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2840))

Abstract

LA-MPI is a high-performance, network-fault-tolerant implementation of MPI designed for terascale clusters that are inherently unreliable due to their very large number of system components and to trade-offs between cost and performance. This paper reviews the architectural design of LA-MPI, focusing on our approach to guaranteeing data integrity. We discuss our network data path abstraction that makes LA-MPI highly portable, gives high-performance through message striping, and most importantly provides the basis for network fault tolerance. Finally we include some performance numbers for Quadrics Elan, Myrinet GM and UDP network data paths.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Graham, R.L., Choi, S.-E., Daniel, D.J., Desai, N.N., Minnich, R.G., Rasmussen, C.E., Dean Risinger, L., Sukalski, M.W.: A network-failure-tolerant message-passing system for terascale clusters. In: Proceedings of the 16th international conference on Supercomputing, pp. 77–83. ACM Press, New York (2002)

    Chapter  Google Scholar 

  2. Aulwes, R.T., Daniel, D.J., Desai, N.N., Graham, R.L., Dean Risinger, L., Sukalski, M.W.: LA-MPI: The design and implementation of a network-fault-tolerant MPI for terascale clusters. Technical Report LA-UR-03-0939, Los Alamos National Laboratory (2003)

    Google Scholar 

  3. Message Passing Interface Forum. MPI: A Message Passing Interface Standard. Technical report (1994)

    Google Scholar 

  4. Message Passing Interface Forum. MPI-2.0: Extensions to the Message-Passing Interface. Technical report (1997)

    Google Scholar 

  5. Partridge, C., Hughes, J., Stone, J.: Performance of checksums and CRCs over real data. Computer Communication Review 25(4), 68–76 (1995)

    Article  Google Scholar 

  6. Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996), Honolulu, Hawaii (1996)

    Google Scholar 

  7. Litzkow, M., Livny, M., Mutka, M.: Condor - a hunter of idle workstations. In: 8th International Conference on Distributed Computing System, pp. 108–111. IEEE Computer Society Press, Los Alamitos (1988)

    Google Scholar 

  8. Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: 8th IEEE International Symposium on High Performance Distributed Computing (1999)

    Google Scholar 

  9. Fagg, G., Dongarra, J.: FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In: EuroPVM/ MPI User’s Group Meeting 2000, Springer, Heidelberg (2000)

    Google Scholar 

  10. Thakur, R., Gropp, W., Lusk, E.: Users Guide for ROMIO: A High-Performance, Portable MPI-IO Implementation. Mathematics and Computer Science Division, Argonne National Laboratory, ANL/MCS-TM- 234 (October 1997)

    Google Scholar 

  11. Quadrics Ltd, http://www.quadrics.com/

  12. Petrini, F., Feng, W.-C., Hoisie, A., Coll, S., Frachtenberg, E.: The Quadrics network: High-performance clustering technology. IEEE Micro 22(1), 46–57 (2002)

    Article  Google Scholar 

  13. Myricom, Inc., http://www.myri.com/

  14. Advanced Computing Laboratory, Los Alamos National Laboratory, http://public.lanl.gov/cluster/index.html

  15. Advanced Computing Laboratory, Los Alamos National Laboratory, http://www.acl.lanl.gov/la-mpi

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Aulwes, R.T. et al. (2003). Network Fault Tolerance in LA-MPI. In: Dongarra, J., Laforenza, D., Orlando, S. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2003. Lecture Notes in Computer Science, vol 2840. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39924-7_48

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-39924-7_48

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20149-6

  • Online ISBN: 978-3-540-39924-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics