Skip to main content

TH-MPI: OS Kernel Integrated Fault Tolerant MPI

  • Conference paper
  • First Online:
Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2131))

Abstract

Consisting of large numbers of computing nodes, parallel cluster systems have high risks of individual node failure. To overcome the high overhead drawbacks of current fault tolerant MPI systems, this paper presents TH-MPI for parallel cluster systems. Being integrated into Linux kernel, TH-MPI is implemented in a more effective, transparent and extensive way. With supports of dynamic kernel module and diskless checkpointing technologies, our experiment shows that checkpointing in TH-MPI is effectively optimized.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. G. Stellner, “CoCheck: Checkpointing and Process Migration for MPI”, In Proceedings of the Int’l Parallel Processing Symposium, pp 526–531, 1996.

    Google Scholar 

  2. A. Agbaria and R. Friedman, “Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations”, In the 8th IEEE Int’l Symposium on High Performance Distributed Computing, 1999.

    Google Scholar 

  3. M. Kim and S. Kim, “(Kool MPI): Toward an optimized MPI implementation for the Linux clusters”, Technical Report, Sejong University, Korea, 2000

    Google Scholar 

  4. M. Litzkow, M. Livny, and M. Mutka, “Condor: A hunter of idle workstations”, In Proc. of the 8th Int’l Conference on Distributed Computing Systems (ICDCS’88), 1988.

    Google Scholar 

  5. J. S. Plank, M. Bech, G. Kingsley, and K. Li, “Libckpt: transparent Checkpointing Under UNIX”, In Usenix inter 1995 Technical Conference, pp 220–232, 1995.

    Google Scholar 

  6. E. Pinheiro, “Truly-Transparent Checkpointing of Parallel Applications”, Technical Report, Rutgers University, 1999

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chen, Y., Fang, Q., Du, Z., Li, S. (2001). TH-MPI: OS Kernel Integrated Fault Tolerant MPI. In: Cotronis, Y., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2001. Lecture Notes in Computer Science, vol 2131. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45417-9_15

Download citation

  • DOI: https://doi.org/10.1007/3-540-45417-9_15

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42609-7

  • Online ISBN: 978-3-540-45417-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics