Towards IT Systems Capable of Managing Their Health

Kadirvel, Selvi; Fortes, José A. B.

doi:10.1007/978-3-642-21292-5_5

Selvi Kadirvel¹⁸ &
José A. B. Fortes¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6662))

Included in the following conference series:

Monterey Workshop

1387 Accesses
1 Citations

Abstract

Self-caring systems are systems capable of monitoring and managing their own health and, indirectly, their useful lifetime. Unlike self-healing systems which are reactive to faults and failures, self-caring systems are aware of their health and hence can potentially circumvent and adapt to impending faults, or recover from them quicker and more effectively. Towards a methodology to model and incorporate health management logic and control mechanisms into an Information Technology (IT) system whose health needs to be managed, we propose the following: 1. the use of Petri nets as a discrete event system (DES) graphical model that can also be used for analysis, simulation and execution control, 2. the use of Remaining-Useful-Life (RUL) management and prognosis as a novel way of looking at health management in IT systems 3. the use of a control theoretic framework for RUL management. As a simple illustration of the concept, a controller was built for useful life management in the application execution stage (containing a potential memory exhaustion fault) of an IT system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Murch, R.: Autonomic Computing. IBM Press (2004)
Google Scholar
Marinescu, D.C.: Internet Based Workflow Management: Towards a Semantic Web. Wiley Interscience, Hoboken (2002)
Google Scholar
Stewart, C., Shen, K.: Performance modeling and system management for multi-component online services. In: 2nd Conference on Symposium on Networked Systems Design and Implementation (2005)
Google Scholar
Conallen, J.: Modeling Web application architectures with UML. Communications of the ACM (1999)
Google Scholar
Van der Mei, R.D., Hariharan, R., Reeser, P.: Web Server Performance Modeling. Telecommunication Systems (2001)
Google Scholar
Urgaonkar, B., Pacifici, G., Shenoy, P., Spreitzer, M., Tantawi, A.: An analytical model for multi-tier internet services and its applications. In: ACM SIGMETRICS (2005)
Google Scholar
Vaidyanathan, K., Trivedi, K.S.: A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems. In: 10th International Symposium on Software Reliability Engineering (1999)
Google Scholar
Kahkipuro, P.: UML-Based Performance Modeling Framework for Component-Based Distributed Systems. LNCS (2001)
Google Scholar
Zhou, M., Venkatesh, K.: Modeling, Simulation and Control of Flexible Manufacturing Systems A Petri net Approach. World Scientific, Singapore (1999)
Book Google Scholar
Vachtsevanos, G., Lewis, F.L., Roemer, M., Hess, A., Wu, B.: Intelligent Fault Diagnosis and Prognosis for Engineering Systems. Wiley, John and Sons, Chichester (2006)
Book Google Scholar
Tang, L., Kacprzynski, G.J., Goebel, K., Saxena, A., Saha, B., Vachtsevanos, G.: Prognostics-Enhanced Automated Contingency Management for Advanced Autonomous Systems. In: Ist International Conference on Prognostics and Health Management (PHM 2008), Denver, CO (2008)
Google Scholar
Engel, S.J., Gilmartin, B.J., Bongort, K., Hess, A.: Prognostics, The Real Issues Involved with Predicting Life Remaining. In: IEEE Aerospace Conference (2000)
Google Scholar
Kalgren, P.W., Baybutt, M., Ginart, A., Minnella, C., Roemer, M.J., Dabney, T.: Application of prognostic health management in digital electronic systems. In: IEEE Aerospace Conference, pp. 1–9 (March 2007)
Google Scholar
Michael, J.R., Kacprzynski, G.J., Nwadiogbu, E.O., Bloor, G.: Development of Diagnostic and Prognostic Technologies for Aerospace Health Management Applications. In: IEEE Aerospace Conference, Big Sky, MT, pp. 3139–3147 (2001)
Google Scholar
Kadirvel, S., Fortes, J.A.B.: Self-Caring IT Systems - A Proof-of-Concept Implementation in Virtualized Environments. In: International Conference on Cloud Computing Technology and Science (CloudCom), Indianapolis, USA (2010)
Google Scholar
Urmanov, A.: Electronic Prognostics for Computer Servers. In: Proceedings of 53rd Annual Reliability and Maintainability Symposium (RAMS), Orlando, Florida, pp. 65–70 (2007)
Google Scholar
Pecht, M., Jaai, R.: A prognostics and health management roadmap for information and electronics-rich systems. Microelectronics Reliability 50(3), 317–323 (2010)
Article Google Scholar
CWE-400: Uncontrolled Resource Consumption. Common Weakness Enumeration. An initiative sponsored by the National Cyber Security Division of the U.S. Department of Homeland Security, http://cwe.mitre.org/data/definitions/400.html (accessed: March 16, 2010)
Zhou, M., Dicesare, F.: Petri Net Synthesis for Discrete Event Control of Manufacturing Systems. Kluwer Publishers, Dordrecht (1993)
Book MATH Google Scholar
Jensen, K., Kristensen, L.M., Wells, L.: Coloured Petri Nets and CPN Tools for modeling and validation of concurrent systems. International Journal on Software Tools for Technology Transfer, STTT (2007)
Google Scholar
Marsan, A.: Stochastic Petri nets: An elementary Introduction. In: Rozenberg, G. (ed.) APN 1989. LNCS, vol. 424, pp. 1–29. Springer, Heidelberg (1990)
Chapter Google Scholar
Muppala, J., Ciardo, G., Trivedi, K.S.: Stochastic Reward Nets for Reliability Prediction. In: Communications in Reliability, Maintainability and Serviceability (1994)
Google Scholar
Kolettis, N., Fulton, N.D.: Software Rejuvenation: Analysis, Module and Applications. In: 25th International Symposium on Fault-Tolerant Computing (1995)
Google Scholar
Vaidyanathan, K., Trivedi, K.S.: A Comprehensive Model for Software Rejuvenation. IEEE Transactions Dependable and Secure Computing (2005)
Google Scholar
Gross, K.C., McMaster, S., Porter, A., Urmanov, A., Votta, L.G., Langer, Y., Urmanov, A.: System’s Availability Maximization Through Preventive Rejuvenation. Sun Microsystems, USA (2006)
Google Scholar
Hamerly, G., Elkan, C.: Bayesian approaches to failure prediction for disk drives. In: 18th International Conference on Machine Learning, pp. 1–9 (2001)
Google Scholar
Dobson, S.: Facilitating a well-founded approach to autonomic systems. In: 5th IEEE Workshop on the Engineering of Autonomic and Autonomous Systems, Belfast, UK (2008)
Google Scholar
Dobson, S.: Achieving an acceptable design model for autonomic systems. In: 4th IEEE International Workshop on Engineering Autonomic and Autonomous Systems Tucson, AZ, pp. 196–202 (2007)
Google Scholar
Graupner, S., Cook, N., Coleman, D.: Automation Controller for Operational IT Management. Integrated Network Management, 363–372 (2007)
Google Scholar
Salfner, F., Wolter, K.: A Petri net model for service availability in redundant computing systems. In: Winter Simulation Conference (2009)
Google Scholar
Dai, Y.S., Marshall, T., Guan, X.H.: Autonomic and Dependable Computing: Moving Towards a Model-Driven Approach. Journal of Computer Science (2006)
Google Scholar
Bellur, U.: Automating Applications Management in the Enterprise using DMTF Information Models. Indian Institute of Technology, Bombay, www.dmtf.org/education/academicalliance (accessed: March 16, 2010)
Van der Aalst, W.M.P., Van Hee, K.M.: Business Process Redesign A Petri net based approach. Computers in Industry (1996)
Google Scholar
Shetty, S., Nordstrom, S., Ahuja, S., Yao, D., Bapty, T., Neema, S.: Systems Integration of Large Scale Autonomic Systems Using Multiple Domain Specific Modeling Languages. In: 12th IEEE International Conference and Workshops on Engineering of Computer-Based Systems, Washington DC (2005)
Google Scholar
Dubey, A., Nordstrom, S., Keskinpala, T., Neema, S., Bapty, T.: Verifying Autonomic Fault Mitigation Strategies in Large Scale Real-Time Systems. In: Third IEEE international Workshop on Engineering of Autonomic and Autonomous Systems, Washington DC (2006)
Google Scholar
Garlan, D., Schmerl, B., Cheng, S.: Software Architecture-Based Self-Adaptation. Autonomic Computing and Networking Part 1, 31–55 (2009)
Article Google Scholar
Salfner, F., Lenk, M., Malek, M.: A Survey of Online Failure Prediction Methods. ACM Comput. Surv. 42(3), Article 10 (2010)
Google Scholar
Williams, A.W., Pertet, S.M., Narasimhan, P.: Tiresias: Black-Box Failure Prediction in Distributed Systems. In: 21st International Parallel and Distributed Processing Symposium (IPDPS), California, USA (2007)
Google Scholar
Brandt, J., Gentile, A., Mayo, J., Pbay, P., Roe, D., Thompson, D., Wong, M.: Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study. In: Workshop on Resiliency in High Performance Computing (HPDC), Munich, Germany (2009)
Google Scholar
Brandt, J., Debusschere, B., Gentile, A., Mayo, J., Pbay, P., Thompson, D., Wong, M.: Using Probabilistic Characterization to Reduce Runtime Faults on HPC Systems. In: Workshop on Resiliency in High-Performance Computing (CCGRID), Lyon, France (2008)
Google Scholar
Ren, X., Lee, S., Eigenmann, R., Bagchi, S.: Resource Failure Prediction in Fine-Grained Cycle Sharing Systems. In: 15th IEEE International Symposium on High Performance Distributed Computing (HPDC-15), France (2006)
Google Scholar
Laguna, I., Arshad, F.A., Grothe, D.M., Bagchi, S.: How To Keep Your Head Above Water While Detecting Errors. In: ACM/IFIP/USENIX 10th International Middleware Conference, Illinois (2009)
Google Scholar
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: International Conference on Dependable Systems and Networks (2006)
Google Scholar
Joshi, K.R., Sanders, W.H., Hiltunen, M.A., Schlichting, R.D.: Automatic Model-Driven Recovery in Distributed Systems. In: 24th IEEE Symposium on Reliable Distributed Systems (2005)
Google Scholar
Gibson, G.A., Schroeder, B., Digney, J.: Failure Tolerance in Petascale Computers. CTWatch Quarterly 3(4), Volume on Software Enabling Technologies for Petascale Science (2007)
Google Scholar
Schroeder, B., Gibson, G.A.: Understanding Failures in Petascale Computers. In: SciDAC 2007. Journal of Physics: Conference Series, vol. 78 (2007)
Google Scholar
Schroeder, B., Gibson, G.A.: Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In: 5th USENIX Conference on File and Storage Technologies, San Jose, CA (2007)
Google Scholar
Schroeder, B., Gibson, G.: A Large Scale Study of Failures in High-performance-computing Systems. In: International Symposium on Dependable Systems and Networks (2006)
Google Scholar
Antunes, J., Neves, N.F., Veríssimo, P.J.: Detection and Prediction of Resource-Exhaustion Vulnerabilities. In: 19th International Symposium on Software Reliability Engineering, pp. 87–96 (2008)
Google Scholar
Hellerstein, J.L., Diao, Y., Parekh, S., Tilbury, D.M.: Feedback Control of Computing Systems. John Wiley and Sons, Chichester (2004)
Book Google Scholar
Gandhi, N., Tilbury, D.M., Diao, Y., Hellerstein, J., Parekh, S.: MIMO control of an Apache Web Server: Modeling and Controller Design. In: American Control Conference, Ann Arbor, Michigan (2002)
Google Scholar
Diao, Y., Hu, X., Tantawi, A., Wu, H.: An adaptive feedback controller for SIP server memory overload protection. In: 6th International Conference on Autonomic Computing, Barcelona, Spain (2009)
Google Scholar
Peterson, J.L.: Petri Net Theory and The Modeling of Systems. Prentice-Hall, New Jersey (1981)
MATH Google Scholar
Bonet, P., Llado, C.M., Puijaner, R., Knottenbelt, W.J.: PIPE2.5 - A Petri net tool for performance modeling. In: Proc. 23rd Latin American Conference on Informatics, San Jose, Costa Rica (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Advanced Computing and Information Systems Lab NSF Center for Autonomic Computing, University of Florida, Gainesville, Florida, USA
Selvi Kadirvel & José A. B. Fortes

Authors

Selvi Kadirvel
View author publications
You can also search for this author in PubMed Google Scholar
José A. B. Fortes
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Oxford University, Wolfson Building, Parks Road, OX1 3QD, Oxford, UK
Radu Calinescu
Microsoft Research, One Microsoft Way, 98052-6399, Redmond, WA, USA
Ethan Jackson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kadirvel, S., Fortes, J.A.B. (2011). Towards IT Systems Capable of Managing Their Health. In: Calinescu, R., Jackson, E. (eds) Foundations of Computer Software. Modeling, Development, and Verification of Adaptive Systems. Monterey Workshop 2010. Lecture Notes in Computer Science, vol 6662. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21292-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-21292-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21291-8
Online ISBN: 978-3-642-21292-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics