Abstract
Independence of special elements, transparency and scalability are very significant features required from the fault tolerance schemes for modern clusters of computers. In order to attend such requirements we developed the RADIC architecture (Redundant Array of Distributed Independent Checkpoints). RADIC is an architecture based on a fully distributed array of processes that collaborate in order to create a distributed fault tolerance controller. This controller works without special, central or stable elements. RADIC implements the fault tolerance activities, transparently to the user application, using a message-log rollback-recovery protocol. Using the RADIC concepts we implemented a prototype, RADICMPI, which contains some standard MPI directives and includes all functionalities of RADIC. We tested RADICMPI in a real environment by injecting failures in nodes of the cluster and monitoring the behavior of the application. Our tests confirmed the correct operation of RADICMPI and the effectiveness of the RADIC mechanism.
This work was supported by the MEyC-Spain under contract TIN 2004-03388.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agbaria, A.M., Friedman, R.: Starfish: fault-tolerant dynamic MPI programs on clusters of workstations. In: Proceedings of 8th International Symposium on High Performance Distributed Computing, August 1999, pp. 167–176 (1999)
Rao, S., Alvisi, L., Vin, H.: Egida: An extensible toolkit for low-overhead fault-tolerance. In: Proceedings of IEEE Fault-Tolerant Computing Symposium (FTCS-29), Madison, USA (June 1999)
Fagg, G., Dongarra, J.: FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Euro PVM/MPI User’s Group Meeting 2000, Berlin, Germany, pp. 346–353. Springer, Heidelberg (2000)
Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: MPI-FT: Portable fault tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)
Batchu, R., Neelamegam, J., Cui, Z., Beddhua, M., Skjellum, A., Dandass, Y., Apte, M.: MPI/FT: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance portable parallel computing. In: Proceedings of the 1st IEEE International Symposium of Cluster Computing and the Grid, Melbourne, Australia (2001)
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of SuperComputing 2002 (SC 2002) (November 2002)
Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. In: Proceedings of LACSI Symposium, Sante Fe, New Mexico, USA (October 2003)
Aulwes, R.T., Daniel, D.J., Desai, N.N., Graham, R.L., Risinger, L.D., Taylor, M.A., Woodall, T.S., Sukalski, M.W.: Architecture of LA-MPI, a network-fault-tolerant MPI. In: Proceedings of 18th International Parallel and Distributed Processing Symposium. IEEE, Los Alamitos (2004)
Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, September 2004, pp. 97–104 (2004)
Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computer Survey 34(3), 375–408 (2002)
Kalaiselvi, S., Rajaraman, V.: A Survey of Checkpointing Algorithms for Parallel and Distributed Computers. In: SADHANA:Academic Proceedings in Engineering Sciences, Bangalore, India, October 2000, vol. 25, part 5, pp. 489–510 (2000)
Duarte, A., Rexachs, D., Luque, E.: A distributed scheme for fault-tolerance in large Clusters of Workstations. In: Proceedings of Parrallel Computer 2005 (Parco 2005), Málaga. Spain, September 13-16 (in press, 2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Duarte, A., Rexachs, D., Luque, E. (2006). An Intelligent Management of Fault Tolerance in Cluster Using RADICMPI. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2006. Lecture Notes in Computer Science, vol 4192. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846802_26
Download citation
DOI: https://doi.org/10.1007/11846802_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39110-4
Online ISBN: 978-3-540-39112-8
eBook Packages: Computer ScienceComputer Science (R0)