A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI

Bland, Wesley; Du, Peng; Bouteiller, Aurelien; Herault, Thomas; Bosilca, George; Dongarra, Jack

doi:10.1007/978-3-642-32820-6_48

Wesley Bland¹⁹,
Peng Du¹⁹,
Aurelien Bouteiller¹⁹,
Thomas Herault¹⁹,
George Bosilca¹⁹ &
…
Jack Dongarra¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7484))

Included in the following conference series:

European Conference on Parallel Processing

3077 Accesses
14 Citations

Abstract

Most predictions of Exascale machines picture billion way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: 1) traditional checkpoint based approaches incur a steep overhead on failure free operations and 2) the dominant programming paradigm for parallel applications (the MPI standard) offers extremely limited support of software-level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI standard, to enable algorithmic based recovery, without incurring the overhead of customary periodic checkpointing. The validity and performance of this approach are evaluated on large scale systems, using the QR factorization as an example.

Download to read the full chapter text

Chapter PDF

Resilient MPI applications using an application-level checkpointing framework and ULFM

Article 22 January 2016

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Article Open access 13 March 2024

Fault-Tolerant MPI

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Cappello, F., Casanova, H., Robert, Y.: Preventive migration vs. preventive checkpointing for extreme scale supercomputers. PPL 21(2), 111–132 (2011)
MathSciNet Google Scholar
Cappello, F., Geist, A., Gropp, B., Kalé, L.V., Kramer, B., Snir, M.: Toward exascale resilience. IJHPCA 23(4), 374–388 (2009)
Google Scholar
Chen, Z., Fagg, G.E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.: Fault tolerant high performance computing by a coding approach. In: Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2005, pp. 213–223. ACM, New York (2005)
Chapter Google Scholar
Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22, 303–312 (2006)
Article Google Scholar
Davies, T., Karlsson, C., Liu, H., Ding, C., Chen, Z.: High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing. In: Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011). ACM (2011)
Google Scholar
Dongarra, J., Beckman, P., et al.: The international exascale software roadmap. IJHPCA 25(11), 3–60 (2011)
Google Scholar
Dongarra, J.J., Blackford, L.S., Choi, J., et al.: ScaLAPACK user’s guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)
Google Scholar
Du, P., Bouteiller, A., Bosilca, G., Herault, T., Dongarra, J.: Algorithm-based Fault Tolerance for Dense Matrix Factorizations. In: 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM (2012)
Google Scholar
Fagg, G.E., Dongarra, J.: FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) PVM/MPI 2000. LNCS, vol. 1908, p. 346. Springer, Heidelberg (2000)
Chapter Google Scholar
Gelenbe, E.: On the optimum checkpoint interval. JoACM 26, 259–270 (1979)
Article MathSciNet MATH Google Scholar
Gropp, W., Lusk, E.: Fault tolerance in message passing interface programs. Int. J. High Perform. Comput. Appl. 18, 363–372 (2004)
Article Google Scholar
Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers 100(6), 518–528 (1984)
Article Google Scholar
Luk, F.T., Park, H.: An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing 5(2), 172–184 (1988)
Article Google Scholar
Plank, J.S., Thomason, M.G.: Processor allocation and checkpoint interval selection in cluster computing systems. JPDC 61, 1590 (2001)
Google Scholar
Schroeder, B., Gibson, G.A.: Understanding Failures in Petascale Computers. SciDAC, Journal of Physics: Conference Series 78 (2007)
Google Scholar
The MPI Forum. MPI: A Message-Passing Interface Standard, Version 2.2. Technical report (2009)
Google Scholar
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17, 530–531 (1974)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer Blvd., Knoxville, TN, 37996-3450, USA
Wesley Bland, Peng Du, Aurelien Bouteiller, Thomas Herault, George Bosilca & Jack Dongarra

Authors

Wesley Bland
View author publications
You can also search for this author in PubMed Google Scholar
Peng Du
View author publications
You can also search for this author in PubMed Google Scholar
Aurelien Bouteiller
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Herault
View author publications
You can also search for this author in PubMed Google Scholar
George Bosilca
View author publications
You can also search for this author in PubMed Google Scholar
Jack Dongarra
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Patras, Computer Technology Institute and Press “Diophantus”,, N. Kazantzaki, 26504, Rio, Greece
Christos Kaklamanis
University of Patras, University Building B, 26504, Rio, Greece
Theodore Papatheodorou
Computer Technology Institute and Press “Diophantus”, University of Patras, N. Kazantzaki, 26504, Rio, Greece
Paul G. Spirakis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bland, W., Du, P., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. (2012). A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds) Euro-Par 2012 Parallel Processing. Euro-Par 2012. Lecture Notes in Computer Science, vol 7484. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32820-6_48

Download citation

DOI: https://doi.org/10.1007/978-3-642-32820-6_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32819-0
Online ISBN: 978-3-642-32820-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI

Abstract

Chapter PDF

Similar content being viewed by others

Resilient MPI applications using an application-level checkpointing framework and ULFM

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Fault-Tolerant MPI

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI

Abstract

Chapter PDF

Similar content being viewed by others

Resilient MPI applications using an application-level checkpointing framework and ULFM

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Fault-Tolerant MPI

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation