Abstract
Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Alvisi, L., Elnozahy, E., Rao, S., Husain, S.A., Mel, A.D.: An analysis of communication induced checkpointing. In: 29th Symposium on Fault-Tolerant Computing (FTCS 1999). IEEE CS Press, Los Alamitos (1999)
Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Dongarra, J.J.: Dodging the cost of unavoidable memory copies in message logging protocols. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 189–197. Springer, Heidelberg (2010)
Bouteiller, A., Bosilca, G., Dongarra, J.: Redesigning the message logging model for high performance. In: ISC 2008, Wiley, Dresden (June 2008) (p. to appear)
Bouteiller, A., Ropars, T., Bosilca, G., Morin, C., Dongarra, J.: Reasons to be pessimist or optimist for failure recovery in high performance clusters. In: IEEE (ed.) Proceedings of the 2009 IEEE Cluster Conference (September 2009)
Buntinas, D., Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A., Rodriguez, E., Cappello, F.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI protocols. Future Generation Computer Systems 24(1), 73–84 (2008), http://www.sciencedirect.com/science/article/B6V06-4N2KT6H-1/2/00e790651475028977cc3031d9ea3980
Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. Transactions on Computer Systems 3(1), 63–75 (1985)
Dongarra, J., Beckman, P., et al.: The international exascale software roadmap. Intl. Journal of High Performance Computer Applications 25(11) (to appear) (2011)
Esteban Meneses, C.L.M., Kalé, L.V.: Team-based message logging: Preliminary results. In: 3rd Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (CCGRID 2010) (May 2010)
Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, pp. 97–104 (September 2004)
Gao, Q., Huang, W., Koop, M.J., Panda, D.K.: Group-based coordinated checkpointing for mpi: A case study on infiniband. In: International Conference on Parallel Processing, ICPP 2007 (2007)
Ho, J.C.Y., Wang, C.L., Lau, F.C.M.: Scalable Group-based Checkpoint/Restart for Large-Scale Message-Passing Systems. In: Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–12. IEEE, Los Alamitos (2008)
Hlary, J.M., Mostefaoui, A., Raynal, M.: Communication-induced determination of consistent snapshots. IEEE Transactions on Parallel and Distributed Systems 10(9), 865–877 (1999)
Kale, L.: Charm++. In: Padua, D. (ed.) Encyclopedia of Parallel Computing, Springer, Heidelberg (to appear) (2011)
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21(7), 558–565 (1978)
Lemarinier, P., Bouteiller, A., Herault, T., Krawezik, G., Cappello, F.: Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In: IEEE International Conference on Cluster Computing. IEEE CS Press, Los Alamitos (2004)
Negara, S., Pan, K.C., Zheng, G., Negara, N., Johnson, R.E., Kale, L.V., Ricker, P.M.: Automatic MPI to AMPI Program Transformation. Tech. Rep. 10-09, Parallel Programming Laboratory (March 2010)
Plank, J.S.: Efficient Checkpointing on MIMD Architectures. Ph.D. thesis, Princeton University (June 1993), http://www.cs.utk.edu/~plank/plank/papers/thesis.html
Rao, S., Alvisi, L., Vin, H.M.: The cost of recovery in message logging protocols. In: 17th Symposium on Reliable Distributed Systems (SRDS), October 1998, pp. 10–18. IEEE CS Press, Los Alamitos (1998)
The MPI Forum: MPI: a message passing interface. In: Supercomputing 1993: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, pp. 878–883. ACM Press, New York (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.J. (2011). Correlated Set Coordination in Fault Tolerant Message Logging Protocols. In: Jeannot, E., Namyst, R., Roman, J. (eds) Euro-Par 2011 Parallel Processing. Euro-Par 2011. Lecture Notes in Computer Science, vol 6853. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23397-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-23397-5_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23396-8
Online ISBN: 978-3-642-23397-5
eBook Packages: Computer ScienceComputer Science (R0)