Abstract
Writing applications capable of executing efficiently in Grids is extremely difficult and tedious for inexperienced users. The distributed resources are typically heterogeneous, non-dedicated, and are offered without any performance or availability guarantees. Systems capable of adapting the execution of an application to the dynamic characteristics of the Grid are essential. This work describes the strategy used to bestow the self-healing property on autonomic EasyGrid MPI applications to withstand process and resource failures. This paper highlights both the difficulties and the low cost solution adopted to offer fault tolerance in applications based on the standard Grid installation of LAM/MPI.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
IBM Research: Autonomic computing. http://www.research.ibm.com/autonomic
Sterritt, R., Parashar, M., Tianfield, H., Unland, R.: A concise introduction to autonomic computing. Advanced Engineering Informatics 19(3), 181–187 (2005)
Nascimento, A.P., Sena, A.C., da Silva, J.A., Vianna, D.Q.C., Boeres, C., Rebello, V.: Managing the execution of large scale MPI applications on computational grids. In: Proc. of the 17th Symposium on Computer Architecture and High Performance Computing, Rio de Janeiro, Brazil, pp. 69–76. IEEE Computer Society Press, Los Alamitos (2005)
Sena, A.C., Nascimento, A.P., da Silva, J.A., Vianna, D.Q.C., Boeres, C., Rebello, V.: On the advantages of an alternative MPI execution model for grids. In: Proceedings of the 7th IEEE International Symposium on Cluster Computing and the Grid, Rio de Janeiro, Brazil, IEEE Computer Society Press, Los Alamitos (2007)
Elnozahy, M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Foster, I.: Designing and Programming Parallel Programs. Addison-Wesley, Reading (1995)
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: toward a scalable fault tolerant MPI for volatile nodes. In: Supercomputing 2002: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, pp. 1–18. IEEE Computer Society Press, Los Alamitos (2002)
Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: MPI-FT: Portable fault tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)
Fagg, G.E., Dongarra, J.: Building and using a fault tolerant MPI implementation. Int. J. High Performance Applications and Supercomputing 18(3), 353–361 (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
da Silva, J.A., Rebello, V.E.F. (2007). Low Cost Self-healing in MPI Applications. In: Cappello, F., Herault, T., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2007. Lecture Notes in Computer Science, vol 4757. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75416-9_24
Download citation
DOI: https://doi.org/10.1007/978-3-540-75416-9_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75415-2
Online ISBN: 978-3-540-75416-9
eBook Packages: Computer ScienceComputer Science (R0)