Abstract
When computational clusters increase in size, their mean time to failure reduces drastically. We generally use checkpoint to minimize the loss of computation. Most check pointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of check pointing, while also proving to be too expensive for dedicated check pointing networks and storage systems. We propose a Stair-Case Replication (SCR) Based MPI check pointing facility. Our reference implementation is based on LAM/MPI; however, it is directly applicable to any MPI implementation. We use the staircase method of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun-X4500-based solution, an EMC storage area network (SAN), and the Ibrix commercial parallel file system and show that they are not scalable, particularly after 64 CPUs. We use the staircase MPI method which allows the access point in a lower complexity level to the higher complexity level which improves the efficiency of the previous method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Koschmann, T.: Paradigm shifts and instructional technology. CSCL (1996)
Dillenbourg, P.: Introduction. Elsevier Science, Amsterdam (1999)
Pankatrius, V., Vossen, G.: Towards E-Learning Grids. In: IEEE Workshop on Knowledge Grid and Grid Intelligence, Halifax, New Scotia, Canada (2003)
Caballé, S., Xhafa, F., Daradoumis, T.: A service-oriented platform for the enhancement and effectiveness of the collaborative learning process in distributed environments. In: Chung, S. (ed.) OTM 2007, Part II. LNCS, vol. 4804, pp. 1280–1287. Springer, Heidelberg (2007)
Bahrami, K., Abedi, M., Daemi, B.: AICT 2007, pp. 29–35. IEEE Computer Society, Los Alamitos (2008)
Wang, Q., Huang, G., Shen, J., Mei, H., Yang, F.: COMPSAC 2003, November 3-6, pp. 230–235 (2003)
Blair, G.S., Blair, L., Issarny, V., et al.: Proc. of Middleware, pp. 164–184 (2000)
Bruneton, E., Coupaye, T., Leclercq, M., Quema, V., Sterain, J.-B.: An open component model and its support in java. In: Crnković, I., Stafford, J.A., Schmidt, H.W., Wallnau, K. (eds.) CBSE 2004. LNCS, vol. 3054, pp. 7–22. Springer, Heidelberg (2004)
Narasimhan, P.: Transparent fault tolerance for CORBA (1999)
Kim, K., Lawrence, T.: Adaptive fault tolerance in complex real-time distributed computer applications. Computer Communications 15(4) (May 1992)
Froihofer, L., Goeschka, K.M., Osrael, J.: Middleware support for adaptive dependability. In: Cerqueira, R., Pasquale, F. (eds.) Middleware 2007. LNCS, vol. 4834, pp. 308–327. Springer, Heidelberg (2007)
McKinley, P., Sadjadi, S., Kasten, E., Cheng, B.: Composing adaptive software. IEEE Computer 37(07), 56–64 (2004)
The MPI Forum. MPI: A Message Passing Interface. In: Proc. Ann. Supercomputing Conf. (SC 1993) (ICPP 2006),pp. 471-478 (2006)
Burns, G., Daoud, R., Vaigl, J.: LAM: An Open Cluster Environment for MPI. In: Proc. Supercomputing Symp., pp. 379–386 (1994)
Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E. (2005)
Squyres, J.M., Lumsdaine, A.: A Component Architecture for LAM/MPI (2003)
InfiniBand Trade Assoc., InfiniBand (2007), http://www.infinibandta.org/home
Myricom, Myrinet (2007), http://www.myricom.com
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bansal, S., Sharma, S., Trivedi, I. (2011). A Novel Stair-Case Replication (SCR) Based Fault Tolerance for MPI Applications. In: Das, V.V., Thomas, G., Lumban Gaol, F. (eds) Information Technology and Mobile Communication. AIM 2011. Communications in Computer and Information Science, vol 147. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20573-6_80
Download citation
DOI: https://doi.org/10.1007/978-3-642-20573-6_80
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20572-9
Online ISBN: 978-3-642-20573-6
eBook Packages: Computer ScienceComputer Science (R0)