An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI

Duarte, Angelo; Rexachs, Dolores; Luque, Emilio

doi:10.1007/11846802_26

Angelo Duarte²⁰,
Dolores Rexachs²⁰ &
Emilio Luque²⁰

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 4192))

Included in the following conference series:

European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting

1192 Accesses
8 Citations

Abstract

Independence of special elements, transparency and scalability are very significant features required from the fault tolerance schemes for modern clusters of computers. In order to attend such requirements we developed the RADIC architecture (Redundant Array of Distributed Independent Checkpoints). RADIC is an architecture based on a fully distributed array of processes that collaborate in order to create a distributed fault tolerance controller. This controller works without special, central or stable elements. RADIC implements the fault tolerance activities, transparently to the user application, using a message-log rollback-recovery protocol. Using the RADIC concepts we implemented a prototype, RADICMPI, which contains some standard MPI directives and includes all functionalities of RADIC. We tested RADICMPI in a real environment by injecting failures in nodes of the cluster and monitoring the behavior of the application. Our tests confirmed the correct operation of RADICMPI and the effectiveness of the RADIC mechanism.

This work was supported by the MEyC-Spain under contract TIN 2004-03388.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agbaria, A.M., Friedman, R.: Starfish: fault-tolerant dynamic MPI programs on clusters of workstations. In: Proceedings of 8th International Symposium on High Performance Distributed Computing, August 1999, pp. 167–176 (1999)
Google Scholar
Rao, S., Alvisi, L., Vin, H.: Egida: An extensible toolkit for low-overhead fault-tolerance. In: Proceedings of IEEE Fault-Tolerant Computing Symposium (FTCS-29), Madison, USA (June 1999)
Google Scholar
Fagg, G., Dongarra, J.: FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Euro PVM/MPI User’s Group Meeting 2000, Berlin, Germany, pp. 346–353. Springer, Heidelberg (2000)
Google Scholar
Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: MPI-FT: Portable fault tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)
Article Google Scholar
Batchu, R., Neelamegam, J., Cui, Z., Beddhua, M., Skjellum, A., Dandass, Y., Apte, M.: MPI/FT: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance portable parallel computing. In: Proceedings of the 1st IEEE International Symposium of Cluster Computing and the Grid, Melbourne, Australia (2001)
Google Scholar
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of SuperComputing 2002 (SC 2002) (November 2002)
Google Scholar
Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. In: Proceedings of LACSI Symposium, Sante Fe, New Mexico, USA (October 2003)
Google Scholar
Aulwes, R.T., Daniel, D.J., Desai, N.N., Graham, R.L., Risinger, L.D., Taylor, M.A., Woodall, T.S., Sukalski, M.W.: Architecture of LA-MPI, a network-fault-tolerant MPI. In: Proceedings of 18th International Parallel and Distributed Processing Symposium. IEEE, Los Alamitos (2004)
Google Scholar
Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, September 2004, pp. 97–104 (2004)
Google Scholar
Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computer Survey 34(3), 375–408 (2002)
Article Google Scholar
Kalaiselvi, S., Rajaraman, V.: A Survey of Checkpointing Algorithms for Parallel and Distributed Computers. In: SADHANA:Academic Proceedings in Engineering Sciences, Bangalore, India, October 2000, vol. 25, part 5, pp. 489–510 (2000)
Google Scholar
Duarte, A., Rexachs, D., Luque, E.: A distributed scheme for fault-tolerance in large Clusters of Workstations. In: Proceedings of Parrallel Computer 2005 (Parco 2005), Málaga. Spain, September 13-16 (in press, 2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Architecture and Operating Systems Department, University Autonoma of Barcelona, ETSE, QC/3088, Bellaterra, 08193, Barcelona, Spain
Angelo Duarte, Dolores Rexachs & Emilio Luque

Authors

Angelo Duarte
View author publications
You can also search for this author in PubMed Google Scholar
Dolores Rexachs
View author publications
You can also search for this author in PubMed Google Scholar
Emilio Luque
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Forschungszentrum Jülich, ZAM, 52425, Jülich, Germany
Bernd Mohr
NEC Europe Ltd., NEC Laboratories Europe, Rathausallee 10, D-53757, Sankt Augustin, Germany
Jesper Larsson Träff
Dolphin Interconnect Solutions ASA R&D Germany, Siebengebirgsblick 26, 53343, Wachtberg, Germany
Joachim Worringen
Computer Science Department, University of Tennessee, 37996-3450, Knoxville, TN, USA
Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Duarte, A., Rexachs, D., Luque, E. (2006). An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2006. Lecture Notes in Computer Science, vol 4192. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846802_26

Download citation

DOI: https://doi.org/10.1007/11846802_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39110-4
Online ISBN: 978-3-540-39112-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics