Skip to main content

An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI

  • Conference paper
Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI 2006)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 4192))

Abstract

Independence of special elements, transparency and scalability are very significant features required from the fault tolerance schemes for modern clusters of computers. In order to attend such requirements we developed the RADIC architecture (Redundant Array of Distributed Independent Checkpoints). RADIC is an architecture based on a fully distributed array of processes that collaborate in order to create a distributed fault tolerance controller. This controller works without special, central or stable elements. RADIC implements the fault tolerance activities, transparently to the user application, using a message-log rollback-recovery protocol. Using the RADIC concepts we implemented a prototype, RADICMPI, which contains some standard MPI directives and includes all functionalities of RADIC. We tested RADICMPI in a real environment by injecting failures in nodes of the cluster and monitoring the behavior of the application. Our tests confirmed the correct operation of RADICMPI and the effectiveness of the RADIC mechanism.

This work was supported by the MEyC-Spain under contract TIN 2004-03388.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agbaria, A.M., Friedman, R.: Starfish: fault-tolerant dynamic MPI programs on clusters of workstations. In: Proceedings of 8th International Symposium on High Performance Distributed Computing, August 1999, pp. 167–176 (1999)

    Google Scholar 

  2. Rao, S., Alvisi, L., Vin, H.: Egida: An extensible toolkit for low-overhead fault-tolerance. In: Proceedings of IEEE Fault-Tolerant Computing Symposium (FTCS-29), Madison, USA (June 1999)

    Google Scholar 

  3. Fagg, G., Dongarra, J.: FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Euro PVM/MPI User’s Group Meeting 2000, Berlin, Germany, pp. 346–353. Springer, Heidelberg (2000)

    Google Scholar 

  4. Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: MPI-FT: Portable fault tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)

    Article  Google Scholar 

  5. Batchu, R., Neelamegam, J., Cui, Z., Beddhua, M., Skjellum, A., Dandass, Y., Apte, M.: MPI/FT: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance portable parallel computing. In: Proceedings of the 1st IEEE International Symposium of Cluster Computing and the Grid, Melbourne, Australia (2001)

    Google Scholar 

  6. Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of SuperComputing 2002 (SC 2002) (November 2002)

    Google Scholar 

  7. Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. In: Proceedings of LACSI Symposium, Sante Fe, New Mexico, USA (October 2003)

    Google Scholar 

  8. Aulwes, R.T., Daniel, D.J., Desai, N.N., Graham, R.L., Risinger, L.D., Taylor, M.A., Woodall, T.S., Sukalski, M.W.: Architecture of LA-MPI, a network-fault-tolerant MPI. In: Proceedings of 18th International Parallel and Distributed Processing Symposium. IEEE, Los Alamitos (2004)

    Google Scholar 

  9. Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, September 2004, pp. 97–104 (2004)

    Google Scholar 

  10. Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computer Survey 34(3), 375–408 (2002)

    Article  Google Scholar 

  11. Kalaiselvi, S., Rajaraman, V.: A Survey of Checkpointing Algorithms for Parallel and Distributed Computers. In: SADHANA:Academic Proceedings in Engineering Sciences, Bangalore, India, October 2000, vol. 25, part 5, pp. 489–510 (2000)

    Google Scholar 

  12. Duarte, A., Rexachs, D., Luque, E.: A distributed scheme for fault-tolerance in large Clusters of Workstations. In: Proceedings of Parrallel Computer 2005 (Parco 2005), Málaga. Spain, September 13-16 (in press, 2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Duarte, A., Rexachs, D., Luque, E. (2006). An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2006. Lecture Notes in Computer Science, vol 4192. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846802_26

Download citation

  • DOI: https://doi.org/10.1007/11846802_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-39110-4

  • Online ISBN: 978-3-540-39112-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics