Correlated Set Coordination in Fault Tolerant Message Logging Protocols

Bouteiller, Aurelien; Herault, Thomas; Bosilca, George; Dongarra, Jack J.

doi:10.1007/978-3-642-23397-5_6

Aurelien Bouteiller¹⁸,
Thomas Herault¹⁸,
George Bosilca¹⁸ &
…
Jack J. Dongarra^18,19

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6853))

Included in the following conference series:

European Conference on Parallel Processing

1483 Accesses
11 Citations

Abstract

Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.

Download to read the full chapter text

Chapter PDF

Reducing the Overhead of Message Logging in Fault-Tolerant HPC Applications

Camel: collective-aware message logging

Article 13 March 2015

A Message Logging Protocol Based on User Level Failure Mitigation

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Alvisi, L., Elnozahy, E., Rao, S., Husain, S.A., Mel, A.D.: An analysis of communication induced checkpointing. In: 29th Symposium on Fault-Tolerant Computing (FTCS 1999). IEEE CS Press, Los Alamitos (1999)
Google Scholar
Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Dongarra, J.J.: Dodging the cost of unavoidable memory copies in message logging protocols. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 189–197. Springer, Heidelberg (2010)
Chapter Google Scholar
Bouteiller, A., Bosilca, G., Dongarra, J.: Redesigning the message logging model for high performance. In: ISC 2008, Wiley, Dresden (June 2008) (p. to appear)
Google Scholar
Bouteiller, A., Ropars, T., Bosilca, G., Morin, C., Dongarra, J.: Reasons to be pessimist or optimist for failure recovery in high performance clusters. In: IEEE (ed.) Proceedings of the 2009 IEEE Cluster Conference (September 2009)
Google Scholar
Buntinas, D., Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A., Rodriguez, E., Cappello, F.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI protocols. Future Generation Computer Systems 24(1), 73–84 (2008), http://www.sciencedirect.com/science/article/B6V06-4N2KT6H-1/2/00e790651475028977cc3031d9ea3980
Article Google Scholar
Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. Transactions on Computer Systems 3(1), 63–75 (1985)
Article Google Scholar
Dongarra, J., Beckman, P., et al.: The international exascale software roadmap. Intl. Journal of High Performance Computer Applications 25(11) (to appear) (2011)
Google Scholar
Esteban Meneses, C.L.M., Kalé, L.V.: Team-based message logging: Preliminary results. In: 3rd Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids (CCGRID 2010) (May 2010)
Google Scholar
Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, pp. 97–104 (September 2004)
Google Scholar
Gao, Q., Huang, W., Koop, M.J., Panda, D.K.: Group-based coordinated checkpointing for mpi: A case study on infiniband. In: International Conference on Parallel Processing, ICPP 2007 (2007)
Google Scholar
Ho, J.C.Y., Wang, C.L., Lau, F.C.M.: Scalable Group-based Checkpoint/Restart for Large-Scale Message-Passing Systems. In: Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–12. IEEE, Los Alamitos (2008)
Google Scholar
Hlary, J.M., Mostefaoui, A., Raynal, M.: Communication-induced determination of consistent snapshots. IEEE Transactions on Parallel and Distributed Systems 10(9), 865–877 (1999)
Article Google Scholar
Kale, L.: Charm++. In: Padua, D. (ed.) Encyclopedia of Parallel Computing, Springer, Heidelberg (to appear) (2011)
Google Scholar
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21(7), 558–565 (1978)
Article MATH Google Scholar
Lemarinier, P., Bouteiller, A., Herault, T., Krawezik, G., Cappello, F.: Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In: IEEE International Conference on Cluster Computing. IEEE CS Press, Los Alamitos (2004)
Google Scholar
Negara, S., Pan, K.C., Zheng, G., Negara, N., Johnson, R.E., Kale, L.V., Ricker, P.M.: Automatic MPI to AMPI Program Transformation. Tech. Rep. 10-09, Parallel Programming Laboratory (March 2010)
Google Scholar
Plank, J.S.: Efficient Checkpointing on MIMD Architectures. Ph.D. thesis, Princeton University (June 1993), http://www.cs.utk.edu/~plank/plank/papers/thesis.html
Rao, S., Alvisi, L., Vin, H.M.: The cost of recovery in message logging protocols. In: 17th Symposium on Reliable Distributed Systems (SRDS), October 1998, pp. 10–18. IEEE CS Press, Los Alamitos (1998)
Google Scholar
The MPI Forum: MPI: a message passing interface. In: Supercomputing 1993: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, pp. 878–883. ACM Press, New York (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

Innovative Computing Laboratory, The University of Tennessee, USA
Aurelien Bouteiller, Thomas Herault, George Bosilca & Jack J. Dongarra
Oak Ridge National Laboratory, USA
Jack J. Dongarra

Authors

Aurelien Bouteiller
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Herault
View author publications
You can also search for this author in PubMed Google Scholar
George Bosilca
View author publications
You can also search for this author in PubMed Google Scholar
Jack J. Dongarra
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France
Emmanuel Jeannot & Raymond Namyst &
Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France
Jean Roman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.J. (2011). Correlated Set Coordination in Fault Tolerant Message Logging Protocols. In: Jeannot, E., Namyst, R., Roman, J. (eds) Euro-Par 2011 Parallel Processing. Euro-Par 2011. Lecture Notes in Computer Science, vol 6853. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23397-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-23397-5_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23396-8
Online ISBN: 978-3-642-23397-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Correlated Set Coordination in Fault Tolerant Message Logging Protocols

Abstract

Chapter PDF

Similar content being viewed by others

Reducing the Overhead of Message Logging in Fault-Tolerant HPC Applications

Camel: collective-aware message logging

A Message Logging Protocol Based on User Level Failure Mitigation

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Correlated Set Coordination in Fault Tolerant Message Logging Protocols

Abstract

Chapter PDF

Similar content being viewed by others

Reducing the Overhead of Message Logging in Fault-Tolerant HPC Applications

Camel: collective-aware message logging

A Message Logging Protocol Based on User Level Failure Mitigation

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation