Research on Optimum Checkpoint Interval for Hybrid Fault Tolerance

Zhu, Lei; Gu, Jianhua; Wang, Yunlan; Zhao, Tianhai

doi:10.1007/978-3-642-45293-2_28

Lei Zhu¹⁸,
Jianhua Gu¹⁸,
Yunlan Wang¹⁸ &
…
Tianhai Zhao¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8299))

Included in the following conference series:

International Workshop on Advanced Parallel Processing Technologies

1347 Accesses
2 Citations

Abstract

With the rapid growth of the high performance computer system size and complexity, passive fault tolerance can no longer effectively provide reliability of the system because of the high overhead and poor scalability of these methods. Hybrid fault tolerant method which is the combination of passive and active fault tolerant approaches has the potential to be widely used in fault tolerance of exascale system. However, there are still many issues of this method need to be ironed out. This paper focuses on the issues of checkpointing of hybrid fault tolerant method. A common question surrounding checkpointing is the optimization of the checkpoint interval. This paper proposes two models to model the systems which adopt hybrid fault tolerance. By comparing their results with the simulation, this paper evaluates the effectiveness of these two models. Experimental result shows that the modified model can not only predict the total work time excellently, but also can predict the optimum checkpoint interval precisely.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Felix, S., Maren, L., Miroslaw, M.: A Survey of Online Failure Prediction Methods. ACM Computing Surveys 42, Article No. 10 (2010)
Google Scholar
Cappello, F.: fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. The International Journal of High Performance Computing Applications 23, 212–226 (2009)
Article Google Scholar
Varela, M.R., Ferreira, K.B., Riesen, R.: Fault-Tolerance for Exascale Systems. In: 2010 IEEE International Conference on Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), pp. 1–4 (2010)
Google Scholar
Leonardo, F., Dolores, R., Emilio, L.: What Is Missing in Current Checkpoint Interval Models? In: 2011 International Conference on Distributed Computing Systems, pp. 322–332 (2011)
Google Scholar
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17, 530–531 (1974)
Article MATH Google Scholar
Gropp, W., Lusk, E.: Fault Tolerance in Message Passing Interface Programs. International Journal of High Performance Computing Applications 18(3), 363–372 (2004)
Article Google Scholar
Daly, J.: A model for predicting the optimum checkpoint interval for restart dumps. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J., Zomaya, A.Y. (eds.) ICCS 2003, Part IV. LNCS, vol. 2660, pp. 3–12. Springer, Heidelberg (2003)
Chapter Google Scholar
Daly, J.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22, 303–312 (2006)
Article Google Scholar
Avritzer, A., Bondi, A., Grottke, M., Trivedi, K.S., et al.: Performance assurance via software rejuvenation: Monitoring, statistics and algorithms. In: Proc. International Conference on Dependable Systems and Networks, pp. 435–444 (2006)
Google Scholar
Gujrati, P., Li, Y., Lan, Z., Thakur, R., et al.: A meta-learning failure predictor for BlueGene/L systems. In: The 2007 International Conference on Parallel Processing, p. 40 (2007)
Google Scholar
Gu, X., Papadimitriou, S., Yu, P.S., Chang, S.P.: Toward predictive failure management for distributed stream processing systems. In: The 28th International Conference on Distributed Computing Systems, pp. 825–832 (2008)
Google Scholar
Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Proactive process-level live migration and back migration in HPC environments. Journal of Parallel and Distributed Computing 72, 254–267 (2012)
Article Google Scholar
Jangjaimon, I., Tzeng, N.-F.: Adaptive Incremental Checkpointing via Delta Compression for Networked Multicore Systems. In: The 27th IEEE International Symposium on Parallel & Distributed Processing (IPDPS 2013), pp. 7–18 (2013)
Google Scholar
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In: The 2010 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010), pp. 1–11 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer, Northwestern Polytechnical University, Xi’an, China
Lei Zhu, Jianhua Gu, Yunlan Wang & Tianhai Zhao

Authors

Lei Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Gu
View author publications
You can also search for this author in PubMed Google Scholar
Yunlan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tianhai Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computing Technology, State Key Laboratory of Computer Architecture, Chinese Academy of Sciences, No. 6 Kexueyuan South road, Haifian District, 100190, Beijing, China
Chenggang Wu
Département d’Informatique, INRIA and École Normale Supérieure, 45 rue d’Ulm, 75005, Paris, France
Albert Cohen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, L., Gu, J., Wang, Y., Zhao, T. (2013). Research on Optimum Checkpoint Interval for Hybrid Fault Tolerance. In: Wu, C., Cohen, A. (eds) Advanced Parallel Processing Technologies. APPT 2013. Lecture Notes in Computer Science, vol 8299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45293-2_28

Download citation

DOI: https://doi.org/10.1007/978-3-642-45293-2_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45292-5
Online ISBN: 978-3-642-45293-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics