Errors and Faults

  • Ana GainaruEmail author
  • Franck Cappello
Part of the Computer Communications and Networks book series (CCN)


Understanding the behavior of failures in large-scale systems is important in order to design techniques to tolerate them. Reliability knowledge of resources can be used in numerous ways by scientist of systems administrators: (1) it can be used to improve the quality of service of the machine; (2) to reduce performance loss due to unexpected failures either by reliability-aware scheduling or by reliability-aware checkpointing; (3) to design more resilient applications, programming models or machines in the future. This chapter focuses on offering an overview of failures observed in real large-scale systems and their characteristics, with an emphasis on modeling, detection, and prediction.


Error Correct Code Node Failure Failure Type Memory Error Soft Error 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



Ana Gainaru’s work is supported by the Blue Waters sustained-Petascale computing project, funded by the National Science Foundation (award number OCI 07-25070) and the state of Illinois. This chapter is build on material from publications co-authored with numerous colleagues. The authors would like to thank Leonardo Bautista-Gomez, Mohamed Slim Bouguerra, Jeremy Enos, Joshi Fullop, Eric Heien, Derrick Kondo, and William Kramer.


  1. 1.
    Anaya IDP, Simko V, Bourcier J, Plouzeau N, Jézéquel J-M (2014) A prediction-driven adaptation approach for self-adaptive sensor networks. In: Proceedings of the 9th international symposium on software engineering for adaptive and self-managing systems, SEAMS 2014. ACM, New York, pp 145–154Google Scholar
  2. 2.
    Andrzejak A, Silva L (2007) Deterministic models of software aging and optimal rejuvenation schedules. In: 10th IFIP/IEEE international symposium on integrated network management, IM’07, pp 159–168Google Scholar
  3. 3.
    Aupy G, Robert Y, Vivien F, Zaidouni D (2012) Impact of fault prediction on checkpointing strategies. Rapport de recherche RR-8023, INRIAGoogle Scholar
  4. 4.
    Aupy G, Robert Y, Vivien F, Zaidouni D (2013) Checkpointing strategies with prediction windows. In: 2013 IEEE 19th Pacific Rim international symposium on dependable computing (PRDC), pp 1–10Google Scholar
  5. 5.
    Avizienis A, Laprie J-C, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE J Dependable Secur Comput 1:11–33CrossRefGoogle Scholar
  6. 6.
    Bairavasundaram LN, Goodson GR, Pasupathy S, Schindler J (2007) An analysis of latent sector errors in disk drives. In: Proceedings of the 2007 ACM SIGMETRICS international conference on measurement and modeling of computer systems, SIGMETRICS’07. ACM, New York, pp 289–300Google Scholar
  7. 7.
    Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S (2011) FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, pp 1–32Google Scholar
  8. 8.
    Bolander N, Qiu H, Eklund N, Hindle E, Rosenfeld T (2009) Physics-based remaining useful life predictions for aircraft engine bearing prognosis. In: Conference of the prognostics and health management societyGoogle Scholar
  9. 9.
    Bouguerra MS, Gainaru A, Cappello F (2013) Failure prediction: what to do with unpredicted failures? In: 28th IEEE international parallel and distributed processing symposiumGoogle Scholar
  10. 10.
    Bouguerra MS, Gainaru A, Cappello F, Gomez LB, Maruyama N, Matsuoka S (2013) Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing. In: Proceedings of IEEE IPDPS 2013. IEEE PressGoogle Scholar
  11. 11.
    Cappello F, Geist A, Gropp B, Kale L, Kramer W, Snir M (2009) Toward exascale resilience. Int J High Perform Comput Appl 23:374–388CrossRefGoogle Scholar
  12. 12.
    Cappello F, Casanova H, Robert Y (2010) Checkpointing versus migration for post-petascale supercomputers. In: 2010 39th international conference on parallel processing (ICPP), pp 168–177Google Scholar
  13. 13.
    Chen MY, Accardi A, Kıcıman E, Lloyd J, Patterson D, Fox A, Brewer E (2004) Path-based failure and evolution management. In: Proceedings of the international symposium on networked system design and implementation, NSDI’04, pp 309–322Google Scholar
  14. 14.
    Cotroneo D, Natella R, Pietrantuono R, Russo S (2014) A survey of software aging and rejuvenation studies. J Emerg Technol Comput Syst 10(1):8:1–8:34CrossRefGoogle Scholar
  15. 15.
    Csenki A (1990) Bayes predictive analysis of a fundamental software reliability model. IEEE Trans Reliab 39:177–183CrossRefzbMATHGoogle Scholar
  16. 16.
    DeBardeleben N, Daly J, Scott S, Harrod W (2009) High-end computing resilience: analysis of issues facing the HEC community and path forward for research and development. National HPC workshop on resilienceGoogle Scholar
  17. 17.
    Di S, Berrocal E, Bautista-Gomez L, Heisey K, Gupta R, Cappello F (2014) Toward effective detection of silent data corruptions for HPC applications. In: Proceedings of the 28th ACM international conference on supercomputing, SC’14Google Scholar
  18. 18.
    Dick T, Barkan C, Chapman E, Stehly M (2000) Predicting the occurrence of broken rails: a quantitative approach. In: Proceedings of the American railway engineering and maintenance of way association annual conferenceGoogle Scholar
  19. 19.
    Dongarra J, Beckman P, Moore T, Aerts P, Aloisio G, Andre J-C, Barkai D, Berthou J-Y, Boku T, Braunschweig B, Cappello F, Chapman B, Chi X (2011) The international exascale software project roadmap. Int J High Perform Comput Appl 25(1):3–60CrossRefGoogle Scholar
  20. 20.
    El-Sayed N, Schroeder B (2013) Reading between the lines of failure logs: understanding how HPC systems fail. In: 2013 43rd annual IEEE/IFIP international conference on dependable systems and networks (DSN), pp 1–12Google Scholar
  21. 21.
    Elnozahy E, Bianchini R, El-Ghazawi T, Fox A, Godfrey F, Hoisie A, McKinley K, Melhem R, Plank J, Ranganathan P et al (2008) System resilience at extreme scale. Technical report for the defence advanced research project agencyGoogle Scholar
  22. 22.
    Farr W (1996) Software reliability modeling survey. Handbook of software reliability engineering. McGraw-Hill, New York, pp 71–117Google Scholar
  23. 23.
    Feitelson DG (2002) Workload modeling for performance evaluation. Performance evaluation of complex systems: techniques and tools. Springer, Berlin, pp 114–141CrossRefGoogle Scholar
  24. 24.
    Fiala D, Mueller F, Engelmann C, Riesen R, Ferreira K, Brightwell R (2012) Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC’12. IEEE Computer Society Press, Los Alamitos, pp 78:1–78:12Google Scholar
  25. 25.
    Fu S, Xu C (2007) Quantifying temporal and spatial fault event correlation for proactive failure management. In: IEEE proceedings of symposium on reliable and distributed systemsGoogle Scholar
  26. 26.
    Gainaru A, Cappello F, Fullop J, Trausan-Matu S, Kramer W (2011) Adaptive event prediction strategy with dynamic time window for large-scale HPC systems. In: Managing large-scale systems via the analysis of system logs and the application of machine learning techniques, SLAML’11. ACM, New York, pp 4:1–4:8Google Scholar
  27. 27.
    Gainaru A, Cappello F, Trausan-Matu S, Kramer W (2011) Event log mining tool for large scale HPC systems. In: Proceedings of the 17th international conference on parallel processing—volume part I, Euro-Par’11. Springer, Berlin, pp 52–64Google Scholar
  28. 28.
    Gainaru A, Cappello F, Snir M, Kramer W (2012) Fault prediction under the microscope: a closer look into HPC systems. In: Proceedings of 2012 international conference for high performance computing, networking, storage and analysis. IEEE PressGoogle Scholar
  29. 29.
    Gainaru A, Cappello F, Kramer W (2012) Taming of the shrew: modeling the normal and faulty behavior of large-scale HPC systems. In: Proceedings of IEEE IPDPS 2012. IEEE PressGoogle Scholar
  30. 30.
    Gertsbakh I (2000) Reliability theory: with applications to preventive maintenance. Springer, BerlinGoogle Scholar
  31. 31.
    Guan Q, Zhang Z, Fu S (2011) Ensemble of Bayesian predictors for autonomic failure management in cloud computing. In: 20th international conference on computer communications and networks, pp 1–6Google Scholar
  32. 32.
    Guermouche A, Ropars T, Snir M, Cappello F (2012) HydEE: failure containment without event logging for large scale send-deterministic MPI applications. In: 2012 IEEE 26th international parallel and distributed processing symposium (IPDPS), pp 1216–1227Google Scholar
  33. 33.
    Hacker T, Romero F (2009) An analysis of clustered failures on supercomputing systems. J Parallel Distrib Comput 69:652–665CrossRefGoogle Scholar
  34. 34.
    Hacker TJ, Romero F, Carothers CD (2009) An analysis of clustered failures on large supercomputing systems. J Parallel Distrib Comput 69:652–665CrossRefGoogle Scholar
  35. 35.
    Hamerly G, Elkan C (2001) Bayesian approaches to failure prediction for disk drives. In: Proceedings of the eighteenth international conference on machine learning, pp 202–209Google Scholar
  36. 36.
    Heien E, Kondo D, Gainaru A, LaPine D, Kramer B, Cappello F (2011) Modeling and tolerating heterogeneous failures in large parallel systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis. ACM, p 45Google Scholar
  37. 37.
    Holmgren M (1996) Comparison between different methods for fatigue life prediction of bogie beams. Rakenteiden Mekaniikka, vol 29Google Scholar
  38. 38.
    Hwang A, Stefanovici I, Schroeder B (2012) Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. SIGARCH Comput Archit News 40(1):111–122CrossRefGoogle Scholar
  39. 39.
    Javadi B, Kondo D, Vincent J-M, Anderson D (2011) Discovering statistical models of availability in large distributed systems: an empirical study of SETI@home. IEEE Trans Parallel Distrib Syst 22(11):1896–1903CrossRefGoogle Scholar
  40. 40.
    Jorio D, Laurent A, Teisseire M (2009) Mining frequent gradual itemsets from large databases. In: International conference on intelligent data analysisGoogle Scholar
  41. 41.
    Kharbas K, Kim D, Hoefler T, Mueller F (2012) Assessing HPC failure detectors for MPI jobs. In: Proceedings of the 2012 20th euromicro international conference on parallel, distributed and network-based processing, pp 81–88Google Scholar
  42. 42.
    Kiciman E, Fox A (2005) Detecting application-level failures in component-based internet services. IEEE Trans Neural Netw 16(5):1027–1041CrossRefGoogle Scholar
  43. 43.
    Lan Z, Gu J, Zheng Z, Thakur R, Coghlan S (2010) Dynamic meta-learning for failure prediction in large-scale systems: a case study. J Parallel Distrib Comput 6:630–643CrossRefGoogle Scholar
  44. 44.
    Lan Z, Zheng Z, Li Y (2010) Toward automated anomaly identification in large-scale systems. IEEE Trans Parallel Distrib Syst 21:147–187CrossRefGoogle Scholar
  45. 45.
    Leangsuksun C, Ostrouchov G, Scott SL (2008) Using log information to perform statistical analysis on failures encountered by large-scale HPC deployment. In: Proceedings of the 2008 high availability and performance computing workshopGoogle Scholar
  46. 46.
    Lehmann EL, Casella G (1998) Theory of point estimation, vol 31. Springer, New YorkzbMATHGoogle Scholar
  47. 47.
    Li Y, Lan Z (2006) Exploit failure prediction for adaptive fault-tolerance in cluster computing. In: Sixth IEEE international symposium on cluster computing and the grid, CCGRID 06, vol 1Google Scholar
  48. 48.
    Liang Y (2006) Blue Gene/L failure analysis and prediction models. In: Proceedings of the international conference on dependable systems and networks, pp 425–434Google Scholar
  49. 49.
    Lou J (2010) Mining dependency in distributed systems through unstructured logs analysis. ACM Spec Interes Group Oper Syst (SIGOPS) 44Google Scholar
  50. 50.
    Lu C-D (2013) Failure data analysis of HPC systems. Technical report CoRR abs/1302.4779Google Scholar
  51. 51.
    Lu C-D, Reed DA (2005) Scalable diskless checkpointing for large parallel systems. Technical report, Ph.D. dissertation, University of Illinois at Urbana-ChampainGoogle Scholar
  52. 52.
    Mane SV (2008) False negative estimation: theory, techniques and applications. ProQuest, Ann ArborGoogle Scholar
  53. 53.
    Martino CD, Baccanico F, Fullop J, Kramer W, Kalbarczyk Z, Iyer RK (2014) Lessons learned from the analysis of system failures at petascale: the case of Blue Waters. In: IEEE/IFIP international conference on dependable systems and networks (DSN 2014)Google Scholar
  54. 54.
    Moody A, Bronevetsky G, Mohror K, de Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis, pp 1–11Google Scholar
  55. 55.
    Murray J, Hughes G, Kreutz-Delgado K (2003) Hard drive failure prediction using non-parametric statistical methods. In: Proceedings of ICANN/ICONIPGoogle Scholar
  56. 56.
    Nassar FA, Andrews DM (1985) A methodology for analysis of failure prediction data. In: IEEE real-time systems symposium, pp 160–166Google Scholar
  57. 57.
    Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. In: IEEE international conference on dependable systems and networksGoogle Scholar
  58. 58.
    Panigrahi PK, Dwivedi M, Khandelwal V, Sen M (2003) Prediction of turbulence statistics behind a square cylinder using neural networks and fuzzy logic. J Fluids Eng 125:385–387CrossRefGoogle Scholar
  59. 59.
    Papadogiannakis A, Polychronakis M, Markatos EP (2010) Improving the accuracy of network intrusion detection systems under load using selective packet discarding. In: Proceedings of the third European workshop on system security, EUROSEC’10. ACM, New York, pp 15–21Google Scholar
  60. 60.
    Patra A, Bidhar S, Kumar U (2010) Failure prediction of rail considering rolling contact fatigue. Int J Reliab Qual Saf Eng 17(3):167–177CrossRefGoogle Scholar
  61. 61.
    Rani S, Leangsuksun C, Tikotekar A, Rampure V, Scott S (2006) Toward efficient failure detection and recovery in HPC. In: Proceedings of high availability and performance workshopGoogle Scholar
  62. 62.
    Ricoux P (2013) European exascale software initiative EESI2—towards exascale roadmap implementation. In: 2nd IS-ENES workshop on high-performance computing for climate modelsGoogle Scholar
  63. 63.
    Ruping S (2000) MySVM manual. Technical report, University of Dortmund, CS Department, AI UnitGoogle Scholar
  64. 64.
    Sahoo RK, Oliner AJ, Rish I, Gupta M, Moreira JE, Ma S, Vilalta R, Sivasubramaniam A (2003) Critical event prediction for proactive management in large-scale computer clusters. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD’03. ACM, New York, pp 426–435Google Scholar
  65. 65.
    Salfner F (2006) Modeling event-driven time series with generalized hidden semi-Markov models. Technical report 208, Department of Computer Science, Humboldt UniversityGoogle Scholar
  66. 66.
    Salfner F, Malek M (2007) Using hidden semi-Markov models for effective online failure prediction. In: Symposium on reliable distributed systems, pp 161–174Google Scholar
  67. 67.
    Salfner F, Lenk M, Malek M (2010) A survey of online failure prediction methods. Comput Surv 42:1–42CrossRefGoogle Scholar
  68. 68.
    Schroeder B, Gibson G (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Dependable Secur Comput 7(4):337–350CrossRefGoogle Scholar
  69. 69.
    Schroeder B, Gibson GA (2007) Understanding failures in petascale computers. J Phys: Conf Ser 78:012022Google Scholar
  70. 70.
    Shantharam M, Srinivasmurthy S, Raghavan P (2012) Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM international conference on supercomputing, ICS’12. ACM, New York, pp 69–78Google Scholar
  71. 71.
    Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B, Chien AA et al (2013) Addressing failures in exascale computing. Argonne report ANL/MCS-TM-332Google Scholar
  72. 72.
    Stearley J (2005) Defining and measuring supercomputer reliability, availability and serviceability (RAS). In: Proceedings of the Linux cluster institute conferenceGoogle Scholar
  73. 73.
    Stearley J, Oliner AJ (2008) Bad words: finding faults in spirit’s syslogs. In: The eighth IEEE international symposium on cluster computing and the grid, pp 765–770Google Scholar
  74. 74.
    Taerat N, Naksinehaboon N, Chandler C, Elliott J, Leangsuksun C, Ostrouchov G, Scott S, Engelmann C (2009) Blue Gene/L log analysis and time to interrupt estimation. In: International conference on availability, reliability and security, ARES’09, pp 173–180Google Scholar
  75. 75.
    Thanakornworakij T, Nassar R, Leangsuksun CB, Paun M (2013) Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications. Int J High Perform Comput Appl 27(4):474–482CrossRefGoogle Scholar
  76. 76.
    Tiwari D, Gupta S, Vazhkudai S (2014) Lazy checkpointing: exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In: 2014 44th annual IEEE/IFIP international conference on dependable systems and networks (DSN), pp 25–36Google Scholar
  77. 77.
    Tsai T, Theera-Ampornpunt N, Bagchi S (2012) A study of soft error consequences in hard disk drives In: IEEE international conference on dependable systems and networks, pp 1–8Google Scholar
  78. 78.
    US Department of Energy (2012) Fault Management Workshop. Accessed July 2013
  79. 79.
    Vilalta R, Apte C, Hellerstein J, Ma S, Weiss S (2002) Predictive algorithms in the management of computer systems. IBM Syst J 41:461–474CrossRefGoogle Scholar
  80. 80.
    Wang C, Talwar V, Schwan K, Ranganathan P (2010) Online detection of utility cloud anomalies using metric distributions. NOMS. IEEE, pp 96–103Google Scholar
  81. 81.
    Workshop, I-A (2012) HPC resilience at extreme scale. Accessed July 2013
  82. 82.
    Xu W, Huang L, Fox A, Patterson D, Jordan M (2009) Online system problem detection by mining patterns of console logs. In: Proceedings of the 2009 ninth IEEE international conference on data mining, ICDM’09. IEEE Computer Society, Washington, pp 588–597Google Scholar
  83. 83.
    Yamanishi K (2005) Dynamic syslog mining for network failure monitoring. In: Proceedings of the 11th ACM SIGKDD, international conference on knowledge discovery and data mining. ACM Press, pp 499–508Google Scholar
  84. 84.
    Yigitbasi N, Gallet M, Kondo D, Iosup A, Epema D (2010) Analysis and modeling of time-correlated failures in large-scale distributed systems. In: 2010 11th IEEE/ACM international conference on grid computing (GRID), pp 65–72Google Scholar
  85. 85.
    Yu L, Zheng Z, Lan Z, Coghlan S (2011) Practical online failure prediction for Blue Gene/P: period-based versus event-driven. In: IEEE conference on dependable systems and networks workshops, pp 259–264Google Scholar
  86. 86.
    Zheng G, Shi L, Kale L (2004) FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: 2004 IEEE international conference on cluster computing, pp 93–103Google Scholar
  87. 87.
    Zheng Z, Yu L (2011) Co-analysis of RAS log and job log on Blue Gene/p. In: Proceedings of the 2011 IEEE international parallel and distributed processing symposium, pp 840–851Google Scholar
  88. 88.
    Zheng Z, Li Y, Lan Z (2007) Anomaly localization in large-scale clusters. In: IEEE international conference on cluster computing, pp 322–330Google Scholar
  89. 89.
    Zheng Z, Lan Z, Gupta R, Coghlan S, Beckman P (2010) A practical failure prediction with location and lead time for Blue Gene/P. In: IEEE conference on dependable systems and networks workshops, pp 15–22Google Scholar

Copyright information

© Springer International Publishing Switzerland (outside the USA) 2015

Authors and Affiliations

  1. 1.NCSAUniversity of Illinois at Urbana-ChampaignChampaignUSA
  2. 2.Argonne National LaboratoryLemontUSA

Personalised recommendations