Errors and Faults

Gainaru, Ana; Cappello, Franck

doi:10.1007/978-3-319-20943-2_2

Ana Gainaru⁴ &
Franck Cappello⁵

Part of the book series: Computer Communications and Networks ((CCN))

1417 Accesses
6 Citations

Abstract

Understanding the behavior of failures in large-scale systems is important in order to design techniques to tolerate them. Reliability knowledge of resources can be used in numerous ways by scientist of systems administrators: (1) it can be used to improve the quality of service of the machine; (2) to reduce performance loss due to unexpected failures either by reliability-aware scheduling or by reliability-aware checkpointing; (3) to design more resilient applications, programming models or machines in the future. This chapter focuses on offering an overview of failures observed in real large-scale systems and their characteristics, with an emphasis on modeling, detection, and prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anaya IDP, Simko V, Bourcier J, Plouzeau N, Jézéquel J-M (2014) A prediction-driven adaptation approach for self-adaptive sensor networks. In: Proceedings of the 9th international symposium on software engineering for adaptive and self-managing systems, SEAMS 2014. ACM, New York, pp 145–154
Google Scholar
Andrzejak A, Silva L (2007) Deterministic models of software aging and optimal rejuvenation schedules. In: 10th IFIP/IEEE international symposium on integrated network management, IM’07, pp 159–168
Google Scholar
Aupy G, Robert Y, Vivien F, Zaidouni D (2012) Impact of fault prediction on checkpointing strategies. Rapport de recherche RR-8023, INRIA
Google Scholar
Aupy G, Robert Y, Vivien F, Zaidouni D (2013) Checkpointing strategies with prediction windows. In: 2013 IEEE 19th Pacific Rim international symposium on dependable computing (PRDC), pp 1–10
Google Scholar
Avizienis A, Laprie J-C, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE J Dependable Secur Comput 1:11–33
Article Google Scholar
Bairavasundaram LN, Goodson GR, Pasupathy S, Schindler J (2007) An analysis of latent sector errors in disk drives. In: Proceedings of the 2007 ACM SIGMETRICS international conference on measurement and modeling of computer systems, SIGMETRICS’07. ACM, New York, pp 289–300
Google Scholar
Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S (2011) FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, pp 1–32
Google Scholar
Bolander N, Qiu H, Eklund N, Hindle E, Rosenfeld T (2009) Physics-based remaining useful life predictions for aircraft engine bearing prognosis. In: Conference of the prognostics and health management society
Google Scholar
Bouguerra MS, Gainaru A, Cappello F (2013) Failure prediction: what to do with unpredicted failures? In: 28th IEEE international parallel and distributed processing symposium
Google Scholar
Bouguerra MS, Gainaru A, Cappello F, Gomez LB, Maruyama N, Matsuoka S (2013) Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing. In: Proceedings of IEEE IPDPS 2013. IEEE Press
Google Scholar
Cappello F, Geist A, Gropp B, Kale L, Kramer W, Snir M (2009) Toward exascale resilience. Int J High Perform Comput Appl 23:374–388
Article Google Scholar
Cappello F, Casanova H, Robert Y (2010) Checkpointing versus migration for post-petascale supercomputers. In: 2010 39th international conference on parallel processing (ICPP), pp 168–177
Google Scholar
Chen MY, Accardi A, Kıcıman E, Lloyd J, Patterson D, Fox A, Brewer E (2004) Path-based failure and evolution management. In: Proceedings of the international symposium on networked system design and implementation, NSDI’04, pp 309–322
Google Scholar
Cotroneo D, Natella R, Pietrantuono R, Russo S (2014) A survey of software aging and rejuvenation studies. J Emerg Technol Comput Syst 10(1):8:1–8:34
Article Google Scholar
Csenki A (1990) Bayes predictive analysis of a fundamental software reliability model. IEEE Trans Reliab 39:177–183
Article MATH Google Scholar
DeBardeleben N, Daly J, Scott S, Harrod W (2009) High-end computing resilience: analysis of issues facing the HEC community and path forward for research and development. National HPC workshop on resilience
Google Scholar
Di S, Berrocal E, Bautista-Gomez L, Heisey K, Gupta R, Cappello F (2014) Toward effective detection of silent data corruptions for HPC applications. In: Proceedings of the 28th ACM international conference on supercomputing, SC’14
Google Scholar
Dick T, Barkan C, Chapman E, Stehly M (2000) Predicting the occurrence of broken rails: a quantitative approach. In: Proceedings of the American railway engineering and maintenance of way association annual conference
Google Scholar
Dongarra J, Beckman P, Moore T, Aerts P, Aloisio G, Andre J-C, Barkai D, Berthou J-Y, Boku T, Braunschweig B, Cappello F, Chapman B, Chi X (2011) The international exascale software project roadmap. Int J High Perform Comput Appl 25(1):3–60
Article Google Scholar
El-Sayed N, Schroeder B (2013) Reading between the lines of failure logs: understanding how HPC systems fail. In: 2013 43rd annual IEEE/IFIP international conference on dependable systems and networks (DSN), pp 1–12
Google Scholar
Elnozahy E, Bianchini R, El-Ghazawi T, Fox A, Godfrey F, Hoisie A, McKinley K, Melhem R, Plank J, Ranganathan P et al (2008) System resilience at extreme scale. Technical report for the defence advanced research project agency
Google Scholar
Farr W (1996) Software reliability modeling survey. Handbook of software reliability engineering. McGraw-Hill, New York, pp 71–117
Google Scholar
Feitelson DG (2002) Workload modeling for performance evaluation. Performance evaluation of complex systems: techniques and tools. Springer, Berlin, pp 114–141
Chapter Google Scholar
Fiala D, Mueller F, Engelmann C, Riesen R, Ferreira K, Brightwell R (2012) Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC’12. IEEE Computer Society Press, Los Alamitos, pp 78:1–78:12
Google Scholar
Fu S, Xu C (2007) Quantifying temporal and spatial fault event correlation for proactive failure management. In: IEEE proceedings of symposium on reliable and distributed systems
Google Scholar
Gainaru A, Cappello F, Fullop J, Trausan-Matu S, Kramer W (2011) Adaptive event prediction strategy with dynamic time window for large-scale HPC systems. In: Managing large-scale systems via the analysis of system logs and the application of machine learning techniques, SLAML’11. ACM, New York, pp 4:1–4:8
Google Scholar
Gainaru A, Cappello F, Trausan-Matu S, Kramer W (2011) Event log mining tool for large scale HPC systems. In: Proceedings of the 17th international conference on parallel processing—volume part I, Euro-Par’11. Springer, Berlin, pp 52–64
Google Scholar
Gainaru A, Cappello F, Snir M, Kramer W (2012) Fault prediction under the microscope: a closer look into HPC systems. In: Proceedings of 2012 international conference for high performance computing, networking, storage and analysis. IEEE Press
Google Scholar
Gainaru A, Cappello F, Kramer W (2012) Taming of the shrew: modeling the normal and faulty behavior of large-scale HPC systems. In: Proceedings of IEEE IPDPS 2012. IEEE Press
Google Scholar
Gertsbakh I (2000) Reliability theory: with applications to preventive maintenance. Springer, Berlin
Google Scholar
Guan Q, Zhang Z, Fu S (2011) Ensemble of Bayesian predictors for autonomic failure management in cloud computing. In: 20th international conference on computer communications and networks, pp 1–6
Google Scholar
Guermouche A, Ropars T, Snir M, Cappello F (2012) HydEE: failure containment without event logging for large scale send-deterministic MPI applications. In: 2012 IEEE 26th international parallel and distributed processing symposium (IPDPS), pp 1216–1227
Google Scholar
Hacker T, Romero F (2009) An analysis of clustered failures on supercomputing systems. J Parallel Distrib Comput 69:652–665
Article Google Scholar
Hacker TJ, Romero F, Carothers CD (2009) An analysis of clustered failures on large supercomputing systems. J Parallel Distrib Comput 69:652–665
Article Google Scholar
Hamerly G, Elkan C (2001) Bayesian approaches to failure prediction for disk drives. In: Proceedings of the eighteenth international conference on machine learning, pp 202–209
Google Scholar
Heien E, Kondo D, Gainaru A, LaPine D, Kramer B, Cappello F (2011) Modeling and tolerating heterogeneous failures in large parallel systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis. ACM, p 45
Google Scholar
Holmgren M (1996) Comparison between different methods for fatigue life prediction of bogie beams. Rakenteiden Mekaniikka, vol 29
Google Scholar
Hwang A, Stefanovici I, Schroeder B (2012) Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. SIGARCH Comput Archit News 40(1):111–122
Article Google Scholar
Javadi B, Kondo D, Vincent J-M, Anderson D (2011) Discovering statistical models of availability in large distributed systems: an empirical study of SETI@home. IEEE Trans Parallel Distrib Syst 22(11):1896–1903
Article Google Scholar
Jorio D, Laurent A, Teisseire M (2009) Mining frequent gradual itemsets from large databases. In: International conference on intelligent data analysis
Google Scholar
Kharbas K, Kim D, Hoefler T, Mueller F (2012) Assessing HPC failure detectors for MPI jobs. In: Proceedings of the 2012 20th euromicro international conference on parallel, distributed and network-based processing, pp 81–88
Google Scholar
Kiciman E, Fox A (2005) Detecting application-level failures in component-based internet services. IEEE Trans Neural Netw 16(5):1027–1041
Article Google Scholar
Lan Z, Gu J, Zheng Z, Thakur R, Coghlan S (2010) Dynamic meta-learning for failure prediction in large-scale systems: a case study. J Parallel Distrib Comput 6:630–643
Article Google Scholar
Lan Z, Zheng Z, Li Y (2010) Toward automated anomaly identification in large-scale systems. IEEE Trans Parallel Distrib Syst 21:147–187
Article Google Scholar
Leangsuksun C, Ostrouchov G, Scott SL (2008) Using log information to perform statistical analysis on failures encountered by large-scale HPC deployment. In: Proceedings of the 2008 high availability and performance computing workshop
Google Scholar
Lehmann EL, Casella G (1998) Theory of point estimation, vol 31. Springer, New York
MATH Google Scholar
Li Y, Lan Z (2006) Exploit failure prediction for adaptive fault-tolerance in cluster computing. In: Sixth IEEE international symposium on cluster computing and the grid, CCGRID 06, vol 1
Google Scholar
Liang Y (2006) Blue Gene/L failure analysis and prediction models. In: Proceedings of the international conference on dependable systems and networks, pp 425–434
Google Scholar
Lou J (2010) Mining dependency in distributed systems through unstructured logs analysis. ACM Spec Interes Group Oper Syst (SIGOPS) 44
Google Scholar
Lu C-D (2013) Failure data analysis of HPC systems. Technical report CoRR abs/1302.4779
Google Scholar
Lu C-D, Reed DA (2005) Scalable diskless checkpointing for large parallel systems. Technical report, Ph.D. dissertation, University of Illinois at Urbana-Champain
Google Scholar
Mane SV (2008) False negative estimation: theory, techniques and applications. ProQuest, Ann Arbor
Google Scholar
Martino CD, Baccanico F, Fullop J, Kramer W, Kalbarczyk Z, Iyer RK (2014) Lessons learned from the analysis of system failures at petascale: the case of Blue Waters. In: IEEE/IFIP international conference on dependable systems and networks (DSN 2014)
Google Scholar
Moody A, Bronevetsky G, Mohror K, de Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis, pp 1–11
Google Scholar
Murray J, Hughes G, Kreutz-Delgado K (2003) Hard drive failure prediction using non-parametric statistical methods. In: Proceedings of ICANN/ICONIP
Google Scholar
Nassar FA, Andrews DM (1985) A methodology for analysis of failure prediction data. In: IEEE real-time systems symposium, pp 160–166
Google Scholar
Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. In: IEEE international conference on dependable systems and networks
Google Scholar
Panigrahi PK, Dwivedi M, Khandelwal V, Sen M (2003) Prediction of turbulence statistics behind a square cylinder using neural networks and fuzzy logic. J Fluids Eng 125:385–387
Article Google Scholar
Papadogiannakis A, Polychronakis M, Markatos EP (2010) Improving the accuracy of network intrusion detection systems under load using selective packet discarding. In: Proceedings of the third European workshop on system security, EUROSEC’10. ACM, New York, pp 15–21
Google Scholar
Patra A, Bidhar S, Kumar U (2010) Failure prediction of rail considering rolling contact fatigue. Int J Reliab Qual Saf Eng 17(3):167–177
Article Google Scholar
Rani S, Leangsuksun C, Tikotekar A, Rampure V, Scott S (2006) Toward efficient failure detection and recovery in HPC. In: Proceedings of high availability and performance workshop
Google Scholar
Ricoux P (2013) European exascale software initiative EESI2—towards exascale roadmap implementation. In: 2nd IS-ENES workshop on high-performance computing for climate models
Google Scholar
Ruping S (2000) MySVM manual. Technical report, University of Dortmund, CS Department, AI Unit
Google Scholar
Sahoo RK, Oliner AJ, Rish I, Gupta M, Moreira JE, Ma S, Vilalta R, Sivasubramaniam A (2003) Critical event prediction for proactive management in large-scale computer clusters. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD’03. ACM, New York, pp 426–435
Google Scholar
Salfner F (2006) Modeling event-driven time series with generalized hidden semi-Markov models. Technical report 208, Department of Computer Science, Humboldt University
Google Scholar
Salfner F, Malek M (2007) Using hidden semi-Markov models for effective online failure prediction. In: Symposium on reliable distributed systems, pp 161–174
Google Scholar
Salfner F, Lenk M, Malek M (2010) A survey of online failure prediction methods. Comput Surv 42:1–42
Article Google Scholar
Schroeder B, Gibson G (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Dependable Secur Comput 7(4):337–350
Article Google Scholar
Schroeder B, Gibson GA (2007) Understanding failures in petascale computers. J Phys: Conf Ser 78:012022
Google Scholar
Shantharam M, Srinivasmurthy S, Raghavan P (2012) Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM international conference on supercomputing, ICS’12. ACM, New York, pp 69–78
Google Scholar
Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B, Chien AA et al (2013) Addressing failures in exascale computing. Argonne report ANL/MCS-TM-332
Google Scholar
Stearley J (2005) Defining and measuring supercomputer reliability, availability and serviceability (RAS). In: Proceedings of the Linux cluster institute conference
Google Scholar
Stearley J, Oliner AJ (2008) Bad words: finding faults in spirit’s syslogs. In: The eighth IEEE international symposium on cluster computing and the grid, pp 765–770
Google Scholar
Taerat N, Naksinehaboon N, Chandler C, Elliott J, Leangsuksun C, Ostrouchov G, Scott S, Engelmann C (2009) Blue Gene/L log analysis and time to interrupt estimation. In: International conference on availability, reliability and security, ARES’09, pp 173–180
Google Scholar
Thanakornworakij T, Nassar R, Leangsuksun CB, Paun M (2013) Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications. Int J High Perform Comput Appl 27(4):474–482
Article Google Scholar
Tiwari D, Gupta S, Vazhkudai S (2014) Lazy checkpointing: exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In: 2014 44th annual IEEE/IFIP international conference on dependable systems and networks (DSN), pp 25–36
Google Scholar
Tsai T, Theera-Ampornpunt N, Bagchi S (2012) A study of soft error consequences in hard disk drives In: IEEE international conference on dependable systems and networks, pp 1–8
Google Scholar
US Department of Energy (2012) Fault Management Workshop. http://shadow.dyndns.info/publications/geist12department.pdf. Accessed July 2013
Vilalta R, Apte C, Hellerstein J, Ma S, Weiss S (2002) Predictive algorithms in the management of computer systems. IBM Syst J 41:461–474
Article Google Scholar
Wang C, Talwar V, Schwan K, Ranganathan P (2010) Online detection of utility cloud anomalies using metric distributions. NOMS. IEEE, pp 96–103
Google Scholar
Workshop, I-A (2012) HPC resilience at extreme scale. http://institute.lanl.gov/resilience/docs/Inter-AgencyResilienceReport.pdf. Accessed July 2013
Xu W, Huang L, Fox A, Patterson D, Jordan M (2009) Online system problem detection by mining patterns of console logs. In: Proceedings of the 2009 ninth IEEE international conference on data mining, ICDM’09. IEEE Computer Society, Washington, pp 588–597
Google Scholar
Yamanishi K (2005) Dynamic syslog mining for network failure monitoring. In: Proceedings of the 11th ACM SIGKDD, international conference on knowledge discovery and data mining. ACM Press, pp 499–508
Google Scholar
Yigitbasi N, Gallet M, Kondo D, Iosup A, Epema D (2010) Analysis and modeling of time-correlated failures in large-scale distributed systems. In: 2010 11th IEEE/ACM international conference on grid computing (GRID), pp 65–72
Google Scholar
Yu L, Zheng Z, Lan Z, Coghlan S (2011) Practical online failure prediction for Blue Gene/P: period-based versus event-driven. In: IEEE conference on dependable systems and networks workshops, pp 259–264
Google Scholar
Zheng G, Shi L, Kale L (2004) FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: 2004 IEEE international conference on cluster computing, pp 93–103
Google Scholar
Zheng Z, Yu L (2011) Co-analysis of RAS log and job log on Blue Gene/p. In: Proceedings of the 2011 IEEE international parallel and distributed processing symposium, pp 840–851
Google Scholar
Zheng Z, Li Y, Lan Z (2007) Anomaly localization in large-scale clusters. In: IEEE international conference on cluster computing, pp 322–330
Google Scholar
Zheng Z, Lan Z, Gupta R, Coghlan S, Beckman P (2010) A practical failure prediction with location and lead time for Blue Gene/P. In: IEEE conference on dependable systems and networks workshops, pp 15–22
Google Scholar

Download references

Acknowledgments

Ana Gainaru’s work is supported by the Blue Waters sustained-Petascale computing project, funded by the National Science Foundation (award number OCI 07-25070) and the state of Illinois. This chapter is build on material from publications co-authored with numerous colleagues. The authors would like to thank Leonardo Bautista-Gomez, Mohamed Slim Bouguerra, Jeremy Enos, Joshi Fullop, Eric Heien, Derrick Kondo, and William Kramer.

Author information

Authors and Affiliations

NCSA, University of Illinois at Urbana-Champaign, Champaign, USA
Ana Gainaru
Argonne National Laboratory, Lemont, USA
Franck Cappello

Authors

Ana Gainaru
View author publications
You can also search for this author in PubMed Google Scholar
Franck Cappello
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ana Gainaru .

Editor information

Editors and Affiliations

University of Tennessee, Knoxville, Tennessee, USA
Thomas Herault
Ecole Normale Supérieure de Lyon, Lyon, France
Yves Robert

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gainaru, A., Cappello, F. (2015). Errors and Faults. In: Herault, T., Robert, Y. (eds) Fault-Tolerance Techniques for High-Performance Computing. Computer Communications and Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-20943-2_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-20943-2_2
Published: 02 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20942-5
Online ISBN: 978-3-319-20943-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics