Skip to main content

FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study

  • Conference paper
Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI 2006)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 4192))

Abstract

There is a growing interest in deploying MPI over very large numbers of heterogenous, geographically distributed resources. FT-MPI provides the fault-tolerance necessary at this scale, but presents some issues when crossing multiple administrative domains. Using the H2O metacomputing framework, we add cross-administrative domain interoperability and “pluggability” to FT-MPI. The latter feature allows us, using proxies, to transparently replace one vulnerable module – its name service – with fault-tolerant replacements. We present an algorithm for improving performance of operations over the proxies. We evaluate its performance in a comparison using the original name service, OpenLDAP and current Emory research project HDNS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dewolfs, D., Kurzyniec, D., Sunderam, V., Broeckhove, J., Dhaene, T., Fagg, G.E.: Applicability of Generic Naming Services and Fault-Tolerant Metacomputing with FT-MPI. In: Di Martino, B., Kranzlmüller, D., Dongarra, J. (eds.) EuroPVM/MPI 2005. LNCS, vol. 3666, pp. 268–275. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  2. Kurzyniec, D., Sunderam, V.: Combining FT-MPI with H20: Fault-tolerant MPI across administrative boundaries. In: Proceedings of the HCW 2005-14th Heterogeneous Computing Workshop (2005)

    Google Scholar 

  3. Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: Eighth IEEE International Symposium on High Performance Distributed Computing, p. 31 (1999)

    Google Scholar 

  4. Bouteiller, A., Cappello, F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: ACM/IEEE SC2003 Conference, p. 25 (2003)

    Google Scholar 

  5. Chen, Y., Li, K., Plank, J.S.: CLIP: A checkpointing tool for message-passing parallel programs (1997), Available at: http://citeseerist.psu.edu/chen97clip.html

  6. Elnozahy, E., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback and fast output. IEEE Transactions on Computers, Special Issue on Fault-Tolerant Computing 41(5), 526–531 (1992)

    Google Scholar 

  7. Fagg, G., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.: Process fault-tolerance: Sematics, design and applications for high-performance computing. International Journal for High Performance Applications and Supercomputing (2004)

    Google Scholar 

  8. Kurzyniec, D., Wrzosek, T., Drzewiecki, D., Sunderam, V.: Towards self-organising distributed computing frameworks: The H2O approach. Parallel Processing Letters 13(2), 273–290 (2003)

    Article  MathSciNet  Google Scholar 

  9. Louca, S., Neophytou, N., Lachanas, A., Eviripidou, P.: MPI-FT: Portable fault-tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)

    Article  Google Scholar 

  10. Stellner, G.: CoCheck: Checkpointing and process migration for MPI. In: 10th International Parallel Processing Symposium, pp. 526–531 (1996)

    Google Scholar 

  11. Migliardi, M., Sunderam, V.: The Harness Metacomputing Framework. In: The Ninth SIAM Conference on Parallel Processing for Scientific Computing, S. Antonio (1999)

    Google Scholar 

  12. Gorissen, D., Wendykier, P., Kurzyniec, D., Sunderam, V.: Integrating Heterogeneous Information Services Using JNDI. In: Proceedings of the HCW 2006 - 15th Heterogeneous Computing Workshop, Rhodes Island, Greece (April 2006)

    Google Scholar 

  13. Fagg, G.E., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.: Scalable Fault Tolerant MPI: Extending the Recovery Algorithm. In: Di Martino, B., Kranzlmüller, D., Dongarra, J. (eds.) EuroPVM/MPI 2005. LNCS, vol. 3666, pp. 67–75. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dewolfs, D., Broeckhove, J., Sunderam, V., Fagg, G.E. (2006). FT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2006. Lecture Notes in Computer Science, vol 4192. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846802_24

Download citation

  • DOI: https://doi.org/10.1007/11846802_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-39110-4

  • Online ISBN: 978-3-540-39112-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics