Skip to main content

Architecting Dependable Systems with Proactive Fault Management

  • Chapter
Book cover Architecting Dependable Systems VII

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6420))

Abstract

Management of an ever-growing complexity of computing systems is an everlasting challenge for computer system engineers. We argue that we need to resort to predictive technologies in order to harness the system’s complexity and transform a vision of proactive system and failure management into reality. We describe proactive fault management, provide an overview and taxonomy for online failure prediction methods and present a classification of failure prediction-triggered methods. We present a model to assess the effects of proactive fault management on system reliability and show that overall dependability can significantly be enhanced. After having shown the methods and potential of proactive fault management we describe a blueprint how proactive fault management can be incorporated into a dependable system’s architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amari, S.V., McLaughlin, L.: Optimal design of a condition-based maintenance model. In: Proceedings of Reliability and Maintainability Symposium (RAMS), pp. 528–533 (January 2004)

    Google Scholar 

  2. Andrzejak, A., Silva, L.: Deterministic models of software aging and optimal rejuvenation schedules. In: Proceedings of 10th IEEE/IFIP International Symposium on Integrated Network Management (IM 2007), pp. 159–168 (May 2007)

    Google Scholar 

  3. Avižienis, A., Laprie, J.-C.: Dependable computing: From concepts to design diversity. Proceedings of the IEEE 74(5), 629–638 (1986)

    Article  Google Scholar 

  4. Algirdas Avižienis, J.-C., Laprie, B., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1(1), 11–33 (2004)

    Article  Google Scholar 

  5. Babaoglu, O., Jelasity, M., Montresor, A., Fetzer, C., Leonardi, S., van Moorsel, A., van Steen, M. (eds.): SELF-STAR 2004. LNCS, vol. 3460. Springer, Heidelberg (2005)

    Google Scholar 

  6. Bao, Y., Sun, X., Trivedi, K.S.: Adaptive software rejuvenation: Degradation model and rejuvenation scheme. In: Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN 2003). IEEE Computer Society, Los Alamitos (2003)

    Google Scholar 

  7. Barborak, M., Dahbura, A., Malek, M.: The Consensus Problem in Fault-Tolerant Computing. Computing Surveys (CSUR) 25(2), 171–220 (1993)

    Article  Google Scholar 

  8. Basseville, M., Nikiforov, I.V.: Detection of abrupt changes: theory and application. Prentice Hall, Englewood Cliffs (1993)

    Google Scholar 

  9. Bridgewater, D.: Standardize Messages with the Common Base Event Model (2004), http://www-106.ibm.com/developerworks/autonomic/library/ac-cbe1/

  10. Brown, A., Patterson, D.A.: Embracing failure: A case for recovery-oriented computing (roc). In: High Performance Transaction Processing Symposium (October 2001)

    Google Scholar 

  11. Candea, G., Delgado, M., Chen, M., Fox, A.: Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications. In: Proceedings of 3rd IEEE Workshop on Internet Applications (WIAPP), San Jose, CA (June 2003)

    Google Scholar 

  12. Candea, G., Cutler, J., Fox, A.: Improving availability with recursive microreboots: A soft-state system case study. Performance Evaluation Journal 56(1-3) (March 2004)

    Google Scholar 

  13. Cassady, C.R., Maillart, L.M., Bowden, R.O., Smith, B.K.: Characterization of optimal age-replacement policies. In: IEEE Proceedings of Reliability and Maintainability Symposium, pp. 170–175 (January 1998)

    Google Scholar 

  14. Castelli, V., Harper, R.E., Heidelberger, P., Hunter, S.W., Trivedi, K.S., Vaidyanathan, K., Zeggert, W.P.: Proactive management of software aging. IBM Journal of Research and Development 45(2), 311–332 (2001)

    Article  Google Scholar 

  15. Chakravorty, S., Mendes, C., Kale, L.V.: Proactive fault tolerance in large systems. In: HPCRI Workshop in conjunction with HPCA 2005 (2005)

    Google Scholar 

  16. Cheng, F.T., Wu, S.L., Tsai, P.Y., Chung, Y.T., Yang, H.C.: Application cluster service scheme for near-zero-downtime services. In: IEEE Proceedings of the International Conference on Robotics and Automation, pp. 4062–4067 (2005)

    Google Scholar 

  17. Coleman, D., Thompson, C.: Model Based Automation and Management for the Adaptive Enterprise. In: Proceedings of 12th Annual Workshop of HP OpenView University Association, pp. 171–184 (2005)

    Google Scholar 

  18. International Electrotechnical Commission. Dependability and quality of service. In IEC: International Technical Comission, editor, IEC 60050: International Electrotechnical Vocabulary, IEC, 2 edn. ch. 191 (2002)

    Google Scholar 

  19. Cristian, F., Aghili, H., Strong, R., Dolev, D.: Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement. In: Proceedings of 15th International Symposium on Fault Tolerant Computing (FTCS). IEEE, Los Alamitos (1985)

    Google Scholar 

  20. Csenki, A.: Bayes predictive analysis of a fundamental software reliability model. IEEE Transactions on Reliability 39(2), 177–183 (1990)

    Article  MATH  MathSciNet  Google Scholar 

  21. Buhmann, M.D.: Radial basis functions: theory and implementations. Cambridge monographs on applied and computational mathematics, vol. 12. Cambridge University Press, Cambridge (2003)

    MATH  Google Scholar 

  22. Dohi, T., Goseva-Popstojanova, K., Trivedi, K.S.: Analysis of software cost models with rejuvenation. In: Proceedings of IEEE Intl. Symposium on High Assurance Systems Engineering, HASE 2000 ( November 2000)

    Google Scholar 

  23. Dohi, T., Goseva-Popstojanova, K., Trivedi, K.S.: Statistical non-parametric algorihms to estimate the optimal software rejuvenation schedule. In: Proceedings of the Pacific Rim International Symposium on Dependable Computing, PRDC 2000 (December 2000)

    Google Scholar 

  24. Elnozahy, E.N., Alvisi, L., Wang, Y., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)

    Article  Google Scholar 

  25. Farr, W.: Software reliability modeling survey. In: Lyu, M.R. (ed.) Handbook of software reliability engineering, ch. 3, pp. 71–117. McGraw-Hill, New York (1996)

    Google Scholar 

  26. Flach, P.A.: The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In: Proceedings of 20th International Conference on Machine Learning (ICML 2003), pp. 194–201. AAAI Press, Menlo Park (2003)

    Google Scholar 

  27. Garg, S., Puliafito, A., Telek, M., Trivedi, K.S.: Analysis of preventive maintenance in transactions based software systems. IEEE Trans. Comput. 47(1), 96–107 (1998)

    Article  Google Scholar 

  28. Garg, S., van Moorsel, A., Vaidyanathan, K., Trivedi, K.S.: A methodology for detection and estimation of software aging. In: Proceedings of the 9th International Symposium on Software Reliability Engineering, ISSRE (Novomber 1998)

    Google Scholar 

  29. Gertsbakh, I.: Reliability Theory: with Applications to Preventive Maintenance. Springer, Berlin (2000)

    MATH  Google Scholar 

  30. Grottke, M., Matias, R., Trivedi, K.S.: The Fundamentals of Software Aging. In: Proceedings of Workshop on Software Aging and Rejuvenation, in conjunction with ISSRE, Seattle, WA. IEEE, Los Alamitos (2008)

    Google Scholar 

  31. Grottke, M., Trivedi, K.S.: Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate. Computer 40(2), 107–109 (2007)

    Article  Google Scholar 

  32. Gujrati, P., Li, Y., Lan, Z., Thakur, R., White, J.: A Meta-Learning Failure Predictor for Blue Gene/L Systems. In: Proceedings of International Conference on Parallel Processing (ICPP 2007). IEEE, Los Alamitos (2007)

    Google Scholar 

  33. Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3, 1157–1182 (2003); Special Issue on Variable and Feature Selection

    Article  MATH  Google Scholar 

  34. Wolpert, D.H.: Stacked Generalization. Neural Networks 5(5), 241–259 (1992)

    Article  MathSciNet  Google Scholar 

  35. Hoffmann, G.A., Trivedi, K.S., Malek, M.: A Best Practice Guide to Resource Forecasting for Computing Systems. IEEE Transactions on Reliability 56(4), 615–628 (2007)

    Article  Google Scholar 

  36. Hoffmann, G.A.: Failure Prediction in Complex Computer Systems: A Probabilistic Approach. Shaker, Aachen (2006)

    MATH  Google Scholar 

  37. Hoffmann, G.A., Malek, M.: Call availability prediction in a telecommunication system: A data driven empirical approach. In: Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS 2006), Leeds, United Kingdom ( October 2006)

    Google Scholar 

  38. Horn, P.: Autonomic Computing: IBM’s perspective on the State of Information Technology (October 2001), http://www.research.ibm.com/autonomic/manifesto/autonomic_computing.pdf

  39. Huang, Y., Kintala, C., Kolettis, N., Fulton, N.: Software rejuvenation: Analysis, module and applications. In: Proceedings of IEEE Intl. Symposium on Fault Tolerant Computing, FTCS 25 (1995)

    Google Scholar 

  40. IBM. An architectural blueprint for autonomic computing. White paper (June 2006), http://www-01.ibm.com/software/tivoli/autonomic/pdfs/AC_Blueprint_White_Paper_4th.pdf

  41. Iyer, R.K., Young, L.T., Sridhar, V.: Recognition of error symptoms in large systems. In: Proceedings of 1986 ACM Fall Joint Computer Conference, Dallas, Texas, United States, pp. 797–806. IEEE Computer Society Press, Los Alamitos (1986)

    Google Scholar 

  42. Kajko-Mattson, M.: Can we learn anything from hardware preventive maintenance? In: Proceedings of the Seventh International Conference on Engineering of Complex Computer Systems, ICECCS 2001, pp. 106–111. IEEE Computer Society Press, Los Alamitos (2001)

    Chapter  Google Scholar 

  43. Kulkarni, V.G.: Modeling and Analysis of Stochastic Systems, 1st edn. Chapman and Hall, London (1995)

    MATH  Google Scholar 

  44. Kumar, D., Westberg, U.: Maintenance scheduling under age replacement policy using proportional hazards model and ttt-plotting. European Journal of Operational Research 99(3), 507–515 (1997)

    Article  MATH  Google Scholar 

  45. Laprie, J.-C., Kanoun, K.: Software Reliability and System Reliability. In: Lyu, M.R. (ed.) Handbook of software reliability engineering, pp. 27–69. McGraw-Hill, New York (1996)

    Google Scholar 

  46. Laranjeira, L.A., Malek, M., Jenevein, R.: On tolerating faults in naturally redundant algorithms. In: Proceedings of Tenth Symposium on Reliable Distributed Systems (SRDS), pp. 118–127. IEEE Computer Society Press, Los Alamitos (September 1991)

    Chapter  Google Scholar 

  47. Leangsuksun, C., Liu, T., Rao, T., Scott, S.L., Libby, R.: A failure predictive and policy-based high availability strategy for linux high performance computing cluster. In: The 5th LCI International Conference on Linux Clusters: The HPC Revolution, pp. 18–20 (2004)

    Google Scholar 

  48. Leangsuksun, C., Shen, L., Liu, T., Song, H., Scott, S.L.: Availability prediction and modeling of high mobility oscar cluster. In: IEEE Proceedings of International Conference on Cluster Computing, pp. 380–386 (2003)

    Google Scholar 

  49. Levy, D., Chillarege, R.: Early warning of failures through alarm analysis - a case study in telecom voice mail systems. In: Proceedings of the 14th International Symposium on Software Reliability Engineering, ISSRE 2003, Washington, DC, USA. IEEE Computer Society, Los Alamitos (2003)

    Google Scholar 

  50. Li, Y., Lan, Z.: Exploit failure prediction for adaptive fault-tolerance in cluster computing. In: IEEE Proceedings of the Sixth International Symposium on Cluster Computing and the Grid (CCGRID 2006), pp. 531–538. IEEE Computer Society, Los Alamitos (2006)

    Google Scholar 

  51. Linand, T.-T.Y., Siewiorek, D.P.: Error log analysis: statistical modeling and heuristic trend analysis. IEEE Transactions on Reliability 39(4), 419–432 (1990)

    Article  Google Scholar 

  52. Lin, T.-T.Y.: Design and evaluation of an on-line predictive diagnostic system. Master’s thesis, Department of Electrical and Computer Engineering, Carnegie-Mellon University, Pittsburgh, PA (April 1988)

    Google Scholar 

  53. Malek, M., Cotroneo, D., Kalbarczyk, Z., Madeira, H., Penkler, D., Reitenspiess, M.: search of real data on faults, errors and failures,Panel discussion at Sixth European Dependable Computing Conference (EDCC) (October 2006)

    Google Scholar 

  54. Melliar-Smith, P.M., Randell, B.: Software reliability: The role of programmed exception handling. SIGPLAN Not. 12(3), 95–100 (1977)

    Article  Google Scholar 

  55. Mundie, C., de Vries, P., Haynes, P., Corwine, M.: Trustworthy Computing. Technical report, 10 (2002), http://download.microsoft.com/download/a/f/2/af22fd56-7f19-47aa-8167-4b1d73cd3c57/twc_mundie.doc

  56. Nassar, F.A., Andrews, D.M.: A methodology for analysis of failure prediction data. In: IEEE Real-Time Systems Symposium, pp. 160–166 (1985)

    Google Scholar 

  57. Department of Defense. MIL-HDBK-217F Reliability Prediction of Electronic Equipment. Washington D.C (1990)

    Google Scholar 

  58. Oliner, A., Sahoo, R.: Evaluating cooperative checkpointing for supercomputing systems. In: IEEE Proceedings of 20th International Parallel and Distributed Processing Symposium, IPDPS 2006 (April 2006)

    Google Scholar 

  59. Parekh, S., Gandhi, N., Hellerstein, J., Tilbury, D., Jayram, T.S., Bigus, J.: Using Control Theory to Achieve Service Level Objectives In Performance Management. Real-Time Systems 23(1), 127–141 (2002)

    Article  MATH  Google Scholar 

  60. Parnas, D.L.: Software aging. In: IEEE Proceedings of the 16th International Conference on Software Engineering (ICSE 1994), pp. 279–287. IEEE Computer Society Press, Los Alamitos (1994)

    Chapter  Google Scholar 

  61. Pfefferman, J.D., Cernuschi-Frias, B.: A nonparametric nonstationary procedure for failure prediction. IEEE Transactions on Reliability 51(4), 434–442 (2002)

    Article  Google Scholar 

  62. Randell, B.: System structure for software fault tolerance. IEEE Transactions on Software Engineering 1(2), 220–232 (1975)

    Google Scholar 

  63. Randell, B., Lee, P., Treleaven, P.C.: Reliability issues in computing system design. ACM Computing Survey 10(2), 123–165 (1978)

    Article  MATH  Google Scholar 

  64. Salfner, F.: Event-based Failure Prediction: An Extended Hidden Markov Model Approach. Dissertation.de Verlag im Internet, Berlin, Germany (2008)

    Google Scholar 

  65. Salfner, F., Lenk, M., Malek, M.: A Survey of Online Failure Prediction Methods. ACM Computing Surveys (CSUR) 42(3), 1–42 (2010)

    Article  Google Scholar 

  66. Siewiorek, D.P., Swarz, R.S.: Reliable Computer Systems, 2nd edn. Digital Press, Bedford (1992)

    Google Scholar 

  67. Siewiorek, D.P., Swarz, R.S.: Reliable Computer Systems, 3rd edn., p. 908. A. K. Peters, Wellesley (1998)

    MATH  Google Scholar 

  68. Singer, R.M., Gross, K.C., Herzog, J.P., King, R.W., Wegerich, S.: Model-Based Nuclear Power Plant Monitoring and Fault Detection: Theoretical Foundations. In: Proceedings of Intelligent System Application to Power Systems (ISAP 1997), Seoul, Korea, pp. 60–65 (July 1997)

    Google Scholar 

  69. Starr, A.G.: A structured approach to the selection of condition based maintenance. In IEEE Proceedings of Fifth International Conference on Factory 2000 - The Technology Exploitation Process, pages Condition based maintenance (CBM) triggers maintenance activity on a parameter which is indicative of machine health. Regular tasks, which are the staple of planned preventive maintenance become scheduled inspections and measurements rather than repair or (April 1997)

    Google Scholar 

  70. Sterritt, R., Parashar, M., Tianfield, H., Unland, R.: A concise introduction to autonomic computing. Advanced Engineering Informatics (AEI) 19(3), 181–187 (2005); Autonomic Computing

    Article  Google Scholar 

  71. Vaidyanathan, K., Trivedi, K.S.: A comprehensive model for software rejuvenation. IEEE Transactions on Dependable and Secure Computing 2, 124–137 (2005)

    Article  Google Scholar 

  72. Vaidyanathan, K., Harper, R.E., Hunter, S.W., Trivedi, K.S.: Analysis and implementation of software rejuvenation in cluster systems. In: Proceedings of the 2001 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 62–71. ACM Press, New York (2001)

    Chapter  Google Scholar 

  73. Vilalta, R., Apte, C.V., Hellerstein, J.L., Ma, S., Weiss, S.M.: Predictive algorithms in the management of computer systems. IBM Systems Journal 41(3), 461–474 (2002)

    Article  Google Scholar 

  74. Vilalta, R., Drissi, Y.: A perspective view and survey of meta-learning. Artificial Intelligence Review 18(2), 77–95 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Salfner, F., Malek, M. (2010). Architecting Dependable Systems with Proactive Fault Management. In: Casimiro, A., de Lemos, R., Gacek, C. (eds) Architecting Dependable Systems VII. Lecture Notes in Computer Science, vol 6420. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17245-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-17245-8_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-17244-1

  • Online ISBN: 978-3-642-17245-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics