Architecting Dependable Systems with Proactive Fault Management

Salfner, Felix; Malek, Miroslaw

doi:10.1007/978-3-642-17245-8_8

Felix Salfner¹⁹ &
Miroslaw Malek¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6420))

643 Accesses
2 Citations

Abstract

Management of an ever-growing complexity of computing systems is an everlasting challenge for computer system engineers. We argue that we need to resort to predictive technologies in order to harness the system’s complexity and transform a vision of proactive system and failure management into reality. We describe proactive fault management, provide an overview and taxonomy for online failure prediction methods and present a classification of failure prediction-triggered methods. We present a model to assess the effects of proactive fault management on system reliability and show that overall dependability can significantly be enhanced. After having shown the methods and potential of proactive fault management we describe a blueprint how proactive fault management can be incorporated into a dependable system’s architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amari, S.V., McLaughlin, L.: Optimal design of a condition-based maintenance model. In: Proceedings of Reliability and Maintainability Symposium (RAMS), pp. 528–533 (January 2004)
Google Scholar
Andrzejak, A., Silva, L.: Deterministic models of software aging and optimal rejuvenation schedules. In: Proceedings of 10th IEEE/IFIP International Symposium on Integrated Network Management (IM 2007), pp. 159–168 (May 2007)
Google Scholar
Avižienis, A., Laprie, J.-C.: Dependable computing: From concepts to design diversity. Proceedings of the IEEE 74(5), 629–638 (1986)
Article Google Scholar
Algirdas Avižienis, J.-C., Laprie, B., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1(1), 11–33 (2004)
Article Google Scholar
Babaoglu, O., Jelasity, M., Montresor, A., Fetzer, C., Leonardi, S., van Moorsel, A., van Steen, M. (eds.): SELF-STAR 2004. LNCS, vol. 3460. Springer, Heidelberg (2005)
Google Scholar
Bao, Y., Sun, X., Trivedi, K.S.: Adaptive software rejuvenation: Degradation model and rejuvenation scheme. In: Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN 2003). IEEE Computer Society, Los Alamitos (2003)
Google Scholar
Barborak, M., Dahbura, A., Malek, M.: The Consensus Problem in Fault-Tolerant Computing. Computing Surveys (CSUR) 25(2), 171–220 (1993)
Article Google Scholar
Basseville, M., Nikiforov, I.V.: Detection of abrupt changes: theory and application. Prentice Hall, Englewood Cliffs (1993)
Google Scholar
Bridgewater, D.: Standardize Messages with the Common Base Event Model (2004), http://www-106.ibm.com/developerworks/autonomic/library/ac-cbe1/
Brown, A., Patterson, D.A.: Embracing failure: A case for recovery-oriented computing (roc). In: High Performance Transaction Processing Symposium (October 2001)
Google Scholar
Candea, G., Delgado, M., Chen, M., Fox, A.: Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications. In: Proceedings of 3rd IEEE Workshop on Internet Applications (WIAPP), San Jose, CA (June 2003)
Google Scholar
Candea, G., Cutler, J., Fox, A.: Improving availability with recursive microreboots: A soft-state system case study. Performance Evaluation Journal 56(1-3) (March 2004)
Google Scholar
Cassady, C.R., Maillart, L.M., Bowden, R.O., Smith, B.K.: Characterization of optimal age-replacement policies. In: IEEE Proceedings of Reliability and Maintainability Symposium, pp. 170–175 (January 1998)
Google Scholar
Castelli, V., Harper, R.E., Heidelberger, P., Hunter, S.W., Trivedi, K.S., Vaidyanathan, K., Zeggert, W.P.: Proactive management of software aging. IBM Journal of Research and Development 45(2), 311–332 (2001)
Article Google Scholar
Chakravorty, S., Mendes, C., Kale, L.V.: Proactive fault tolerance in large systems. In: HPCRI Workshop in conjunction with HPCA 2005 (2005)
Google Scholar
Cheng, F.T., Wu, S.L., Tsai, P.Y., Chung, Y.T., Yang, H.C.: Application cluster service scheme for near-zero-downtime services. In: IEEE Proceedings of the International Conference on Robotics and Automation, pp. 4062–4067 (2005)
Google Scholar
Coleman, D., Thompson, C.: Model Based Automation and Management for the Adaptive Enterprise. In: Proceedings of 12th Annual Workshop of HP OpenView University Association, pp. 171–184 (2005)
Google Scholar
International Electrotechnical Commission. Dependability and quality of service. In IEC: International Technical Comission, editor, IEC 60050: International Electrotechnical Vocabulary, IEC, 2 edn. ch. 191 (2002)
Google Scholar
Cristian, F., Aghili, H., Strong, R., Dolev, D.: Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement. In: Proceedings of 15th International Symposium on Fault Tolerant Computing (FTCS). IEEE, Los Alamitos (1985)
Google Scholar
Csenki, A.: Bayes predictive analysis of a fundamental software reliability model. IEEE Transactions on Reliability 39(2), 177–183 (1990)
Article MATH MathSciNet Google Scholar
Buhmann, M.D.: Radial basis functions: theory and implementations. Cambridge monographs on applied and computational mathematics, vol. 12. Cambridge University Press, Cambridge (2003)
MATH Google Scholar
Dohi, T., Goseva-Popstojanova, K., Trivedi, K.S.: Analysis of software cost models with rejuvenation. In: Proceedings of IEEE Intl. Symposium on High Assurance Systems Engineering, HASE 2000 ( November 2000)
Google Scholar
Dohi, T., Goseva-Popstojanova, K., Trivedi, K.S.: Statistical non-parametric algorihms to estimate the optimal software rejuvenation schedule. In: Proceedings of the Pacific Rim International Symposium on Dependable Computing, PRDC 2000 (December 2000)
Google Scholar
Elnozahy, E.N., Alvisi, L., Wang, Y., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)
Article Google Scholar
Farr, W.: Software reliability modeling survey. In: Lyu, M.R. (ed.) Handbook of software reliability engineering, ch. 3, pp. 71–117. McGraw-Hill, New York (1996)
Google Scholar
Flach, P.A.: The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In: Proceedings of 20th International Conference on Machine Learning (ICML 2003), pp. 194–201. AAAI Press, Menlo Park (2003)
Google Scholar
Garg, S., Puliafito, A., Telek, M., Trivedi, K.S.: Analysis of preventive maintenance in transactions based software systems. IEEE Trans. Comput. 47(1), 96–107 (1998)
Article Google Scholar
Garg, S., van Moorsel, A., Vaidyanathan, K., Trivedi, K.S.: A methodology for detection and estimation of software aging. In: Proceedings of the 9th International Symposium on Software Reliability Engineering, ISSRE (Novomber 1998)
Google Scholar
Gertsbakh, I.: Reliability Theory: with Applications to Preventive Maintenance. Springer, Berlin (2000)
MATH Google Scholar
Grottke, M., Matias, R., Trivedi, K.S.: The Fundamentals of Software Aging. In: Proceedings of Workshop on Software Aging and Rejuvenation, in conjunction with ISSRE, Seattle, WA. IEEE, Los Alamitos (2008)
Google Scholar
Grottke, M., Trivedi, K.S.: Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate. Computer 40(2), 107–109 (2007)
Article Google Scholar
Gujrati, P., Li, Y., Lan, Z., Thakur, R., White, J.: A Meta-Learning Failure Predictor for Blue Gene/L Systems. In: Proceedings of International Conference on Parallel Processing (ICPP 2007). IEEE, Los Alamitos (2007)
Google Scholar
Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3, 1157–1182 (2003); Special Issue on Variable and Feature Selection
Article MATH Google Scholar
Wolpert, D.H.: Stacked Generalization. Neural Networks 5(5), 241–259 (1992)
Article MathSciNet Google Scholar
Hoffmann, G.A., Trivedi, K.S., Malek, M.: A Best Practice Guide to Resource Forecasting for Computing Systems. IEEE Transactions on Reliability 56(4), 615–628 (2007)
Article Google Scholar
Hoffmann, G.A.: Failure Prediction in Complex Computer Systems: A Probabilistic Approach. Shaker, Aachen (2006)
MATH Google Scholar
Hoffmann, G.A., Malek, M.: Call availability prediction in a telecommunication system: A data driven empirical approach. In: Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS 2006), Leeds, United Kingdom ( October 2006)
Google Scholar
Horn, P.: Autonomic Computing: IBM’s perspective on the State of Information Technology (October 2001), http://www.research.ibm.com/autonomic/manifesto/autonomic_computing.pdf
Huang, Y., Kintala, C., Kolettis, N., Fulton, N.: Software rejuvenation: Analysis, module and applications. In: Proceedings of IEEE Intl. Symposium on Fault Tolerant Computing, FTCS 25 (1995)
Google Scholar
IBM. An architectural blueprint for autonomic computing. White paper (June 2006), http://www-01.ibm.com/software/tivoli/autonomic/pdfs/AC_Blueprint_White_Paper_4th.pdf
Iyer, R.K., Young, L.T., Sridhar, V.: Recognition of error symptoms in large systems. In: Proceedings of 1986 ACM Fall Joint Computer Conference, Dallas, Texas, United States, pp. 797–806. IEEE Computer Society Press, Los Alamitos (1986)
Google Scholar
Kajko-Mattson, M.: Can we learn anything from hardware preventive maintenance? In: Proceedings of the Seventh International Conference on Engineering of Complex Computer Systems, ICECCS 2001, pp. 106–111. IEEE Computer Society Press, Los Alamitos (2001)
Chapter Google Scholar
Kulkarni, V.G.: Modeling and Analysis of Stochastic Systems, 1st edn. Chapman and Hall, London (1995)
MATH Google Scholar
Kumar, D., Westberg, U.: Maintenance scheduling under age replacement policy using proportional hazards model and ttt-plotting. European Journal of Operational Research 99(3), 507–515 (1997)
Article MATH Google Scholar
Laprie, J.-C., Kanoun, K.: Software Reliability and System Reliability. In: Lyu, M.R. (ed.) Handbook of software reliability engineering, pp. 27–69. McGraw-Hill, New York (1996)
Google Scholar
Laranjeira, L.A., Malek, M., Jenevein, R.: On tolerating faults in naturally redundant algorithms. In: Proceedings of Tenth Symposium on Reliable Distributed Systems (SRDS), pp. 118–127. IEEE Computer Society Press, Los Alamitos (September 1991)
Chapter Google Scholar
Leangsuksun, C., Liu, T., Rao, T., Scott, S.L., Libby, R.: A failure predictive and policy-based high availability strategy for linux high performance computing cluster. In: The 5th LCI International Conference on Linux Clusters: The HPC Revolution, pp. 18–20 (2004)
Google Scholar
Leangsuksun, C., Shen, L., Liu, T., Song, H., Scott, S.L.: Availability prediction and modeling of high mobility oscar cluster. In: IEEE Proceedings of International Conference on Cluster Computing, pp. 380–386 (2003)
Google Scholar
Levy, D., Chillarege, R.: Early warning of failures through alarm analysis - a case study in telecom voice mail systems. In: Proceedings of the 14th International Symposium on Software Reliability Engineering, ISSRE 2003, Washington, DC, USA. IEEE Computer Society, Los Alamitos (2003)
Google Scholar
Li, Y., Lan, Z.: Exploit failure prediction for adaptive fault-tolerance in cluster computing. In: IEEE Proceedings of the Sixth International Symposium on Cluster Computing and the Grid (CCGRID 2006), pp. 531–538. IEEE Computer Society, Los Alamitos (2006)
Google Scholar
Linand, T.-T.Y., Siewiorek, D.P.: Error log analysis: statistical modeling and heuristic trend analysis. IEEE Transactions on Reliability 39(4), 419–432 (1990)
Article Google Scholar
Lin, T.-T.Y.: Design and evaluation of an on-line predictive diagnostic system. Master’s thesis, Department of Electrical and Computer Engineering, Carnegie-Mellon University, Pittsburgh, PA (April 1988)
Google Scholar
Malek, M., Cotroneo, D., Kalbarczyk, Z., Madeira, H., Penkler, D., Reitenspiess, M.: search of real data on faults, errors and failures,Panel discussion at Sixth European Dependable Computing Conference (EDCC) (October 2006)
Google Scholar
Melliar-Smith, P.M., Randell, B.: Software reliability: The role of programmed exception handling. SIGPLAN Not. 12(3), 95–100 (1977)
Article Google Scholar
Mundie, C., de Vries, P., Haynes, P., Corwine, M.: Trustworthy Computing. Technical report, 10 (2002), http://download.microsoft.com/download/a/f/2/af22fd56-7f19-47aa-8167-4b1d73cd3c57/twc_mundie.doc
Nassar, F.A., Andrews, D.M.: A methodology for analysis of failure prediction data. In: IEEE Real-Time Systems Symposium, pp. 160–166 (1985)
Google Scholar
Department of Defense. MIL-HDBK-217F Reliability Prediction of Electronic Equipment. Washington D.C (1990)
Google Scholar
Oliner, A., Sahoo, R.: Evaluating cooperative checkpointing for supercomputing systems. In: IEEE Proceedings of 20th International Parallel and Distributed Processing Symposium, IPDPS 2006 (April 2006)
Google Scholar
Parekh, S., Gandhi, N., Hellerstein, J., Tilbury, D., Jayram, T.S., Bigus, J.: Using Control Theory to Achieve Service Level Objectives In Performance Management. Real-Time Systems 23(1), 127–141 (2002)
Article MATH Google Scholar
Parnas, D.L.: Software aging. In: IEEE Proceedings of the 16th International Conference on Software Engineering (ICSE 1994), pp. 279–287. IEEE Computer Society Press, Los Alamitos (1994)
Chapter Google Scholar
Pfefferman, J.D., Cernuschi-Frias, B.: A nonparametric nonstationary procedure for failure prediction. IEEE Transactions on Reliability 51(4), 434–442 (2002)
Article Google Scholar
Randell, B.: System structure for software fault tolerance. IEEE Transactions on Software Engineering 1(2), 220–232 (1975)
Google Scholar
Randell, B., Lee, P., Treleaven, P.C.: Reliability issues in computing system design. ACM Computing Survey 10(2), 123–165 (1978)
Article MATH Google Scholar
Salfner, F.: Event-based Failure Prediction: An Extended Hidden Markov Model Approach. Dissertation.de Verlag im Internet, Berlin, Germany (2008)
Google Scholar
Salfner, F., Lenk, M., Malek, M.: A Survey of Online Failure Prediction Methods. ACM Computing Surveys (CSUR) 42(3), 1–42 (2010)
Article Google Scholar
Siewiorek, D.P., Swarz, R.S.: Reliable Computer Systems, 2nd edn. Digital Press, Bedford (1992)
Google Scholar
Siewiorek, D.P., Swarz, R.S.: Reliable Computer Systems, 3rd edn., p. 908. A. K. Peters, Wellesley (1998)
MATH Google Scholar
Singer, R.M., Gross, K.C., Herzog, J.P., King, R.W., Wegerich, S.: Model-Based Nuclear Power Plant Monitoring and Fault Detection: Theoretical Foundations. In: Proceedings of Intelligent System Application to Power Systems (ISAP 1997), Seoul, Korea, pp. 60–65 (July 1997)
Google Scholar
Starr, A.G.: A structured approach to the selection of condition based maintenance. In IEEE Proceedings of Fifth International Conference on Factory 2000 - The Technology Exploitation Process, pages Condition based maintenance (CBM) triggers maintenance activity on a parameter which is indicative of machine health. Regular tasks, which are the staple of planned preventive maintenance become scheduled inspections and measurements rather than repair or (April 1997)
Google Scholar
Sterritt, R., Parashar, M., Tianfield, H., Unland, R.: A concise introduction to autonomic computing. Advanced Engineering Informatics (AEI) 19(3), 181–187 (2005); Autonomic Computing
Article Google Scholar
Vaidyanathan, K., Trivedi, K.S.: A comprehensive model for software rejuvenation. IEEE Transactions on Dependable and Secure Computing 2, 124–137 (2005)
Article Google Scholar
Vaidyanathan, K., Harper, R.E., Hunter, S.W., Trivedi, K.S.: Analysis and implementation of software rejuvenation in cluster systems. In: Proceedings of the 2001 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 62–71. ACM Press, New York (2001)
Chapter Google Scholar
Vilalta, R., Apte, C.V., Hellerstein, J.L., Ma, S., Weiss, S.M.: Predictive algorithms in the management of computer systems. IBM Systems Journal 41(3), 461–474 (2002)
Article Google Scholar
Vilalta, R., Drissi, Y.: A perspective view and survey of meta-learning. Artificial Intelligence Review 18(2), 77–95 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institut für Informatik, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany
Felix Salfner & Miroslaw Malek

Authors

Felix Salfner
View author publications
You can also search for this author in PubMed Google Scholar
Miroslaw Malek
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Science, University of Lisbon, Campo Grande, Bloco C6, Piso 3, 1749-016, Lisbon, Portugal
Antonio Casimiro
School of Computing, University of Kent, CT2 7NF, Canterbury, Kent, UK
Rogério de Lemos
Centre for Software Reliability, City University, London, Northampton Square, EC1V 0HB, London, UK
Cristina Gacek

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Salfner, F., Malek, M. (2010). Architecting Dependable Systems with Proactive Fault Management. In: Casimiro, A., de Lemos, R., Gacek, C. (eds) Architecting Dependable Systems VII. Lecture Notes in Computer Science, vol 6420. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17245-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-17245-8_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17244-1
Online ISBN: 978-3-642-17245-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics