Software Reliability and Rejuvenation: Modeling and Analysis

  • Kishor S. Trivedi
  • Kalyanaraman Vaidyanathan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2459)


Several recent studies have established that most system outages are due to software faults. Given the ever increasing complexity of software and the well-developed techniques and analysis for hardware reliability, this trend is not likely to change in the near future. In this paper, we classify software faults and discuss various techniques to deal with them in the testing/debugging phase and the operational phase of the software.We discuss the phenomenon of software aging and a preventive maintenance technique to deal with this problem called software rejuvenation. Stochastic models to evaluate the effectiveness of preventive maintenance in operational software systems and to determine optimal times to perform rejuvenation for different scenarios are described. We also present measurement-based methodologies to detect software aging and estimate its effect on various system resources. These models are intended to help develop software rejuvenation policies. An automated online measurement-based approach has been used in the software rejuvenation agent implemented in a major commercial server.


Multiple Input Multiple Output Preventive Maintenance Software Aging Software Reliability Software Failure 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    E. Adams. Optimizing Preventive Service of the Software Products. IBM Journal of R&D, 28(1):2–14, January 1984.Google Scholar
  2. 2.
    P. E. Amman and J. C. Knight. Data Diversity: An Approach to Software Fault Tolerance. In Proc. of 17th Int. Symp. on Fault Tolerant Computing, pages 122–126, June 1987.Google Scholar
  3. 3.
    A. Avizienis and L. Chen. On the Implementation of N-version Programming for Software Fault Tolerance During Execution. In Proc. IEEE COMPSAC 77, pp 149–155, November 1977.Google Scholar
  4. 4.
    A. Avritzer and E.J. Weyuker. Monitoring Smoothly Degrading Systems for Increased Dependability. Empirical Software Eng. Journal, Vol 2, No. 1, pp 59–77, 1997.CrossRefGoogle Scholar
  5. 5.
    L. Bernstein. Text of seminar delivered by Mr. Bernstein. In University Learning Center, George Mason University, January 29 1996.Google Scholar
  6. 6.
    A. Bobbio, A. Sereno and C. Anglano. Fine Grained Software Degradation Models for Optimal rejuvenation policies. Performance Evaluation, Vol. 46, pp 45–62, 2001.zbMATHCrossRefGoogle Scholar
  7. 7.
    K. Cassidy, K. Gross and A. Malekpour. Advanced Pattern Recognition for Detection of Complex Software Aging in Online Transaction Processing Servers. In Proc. Dep endable Systems and Networks, DSN 2002, Washington D.C., June 2002.Google Scholar
  8. 8.
    V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan and W. Zeggert. Proactive Management of Software Aging. IBM Journal of R&D, Vol. 45, No.2, March 2001.Google Scholar
  9. 9.
    R. Chillarege, S. Biyani and J. Rosenthal. Measurement of Failure Rate in Widely Distributed Software. In Proc. of 25th IEEE Int. Symp. on Fault Tolerant Computing, pp 424–433, Pasadena, CA, July 1995.Google Scholar
  10. 10.
    T. Dohi, K. Goševa-Popstojanova and K. S. Trivedi. Analysis of Software Cost Models with Rejuvenation. In Proc. of the 5th IEEE Int. Symp. on High Assurance Systems Engineering, HASE 2000, Albuquerque, NM, November 2000.Google Scholar
  11. 11.
    T. Dohi, K. Goševa-Popstojanova and K. S. Trivedi. Statistical Non-Parametric Algorithms to Estimate the Optimal Software Rejuvenation Schedule. Proc. of the 2000 Pacific Rim Int. Symp. on Dependable Computing, PRDC 2000, Los Angeles, CA, December 2000.Google Scholar
  12. 12.
    S. Garg, A. Puliafito, M. Telek and K. S. Trivedi. Analysis of Software Rejuvenation Using Markov Regenerative Stochastic Petri Net. In Proc. of the Sixth Int. Symp. on Software Reliability Engineering, pp 180–187, Toulouse, France, October 1995.Google Scholar
  13. 13.
    S. Garg, Y. Huang, C. Kintala and K. S. Trivedi. Time and Load Based Software Rejuvenation: Policy, Evaluation and Optimality. In Proc. of the First Fault-Tolerant Symposium, Madras, India, December 1995.Google Scholar
  14. 14.
    S. Garg, Y. Huang and C. Kintala, K.S. Trivedi, Minimizing Completion Time of a Program by Checkpointing and Rejuvenation. Proc. 1996 ACM SIGMETRICS Philadelphia, PA, pp 252–261, May 1996.Google Scholar
  15. 15.
    S. Garg, A. Puliafito, M. Telek and K. S. Trivedi. Analysis of Preventive Maintenance in Transactions Based Software Systems. IEEE Trans. on Computers, pp 96–107, Vol.47, No.1, January 1998.CrossRefGoogle Scholar
  16. 16.
    S. Garg, A. van Moorsel, K. Vaidyanathan and K. S. Trivedi. A Methodology for Detection and Estimation of Software Aging. In Proc. of the Ninth Int. Symp. on Software Reliability Engineering, pp 282–292, Paderborn, Germany, November 1998.Google Scholar
  17. 17.
    J. Gray. Why do Computers Stop and What Can be Done About it? In Proc. of 5th Symp. on Reliability in Distributed Software and Database Systems, pp 3–12, January 1986.Google Scholar
  18. 18.
    J. Gray. A Census of Tandem System Availability Between 1985 and 1990. IEEE Trans. on Reliability, 39:409–418, October 1990.Google Scholar
  19. 19.
    J. Gray and D. P. Siewiorek. High-Availability Computer Systems. IEEE Computer, pages 39–48, September 1991.Google Scholar
  20. 20.
    B. O. A. Grey. Making SDI Software Reliable through Fault-tolerant Techniques. Defense Electronics, pp 77–80,85-86, August 1987.Google Scholar
  21. 21.
    J. A. Hartigan. Clustering Algorithms. New York: Wiley, 1975.zbMATHGoogle Scholar
  22. 22.
    C. Hirel, B. Tuffin and K. S. Trivedi. SPNP: Stochastic Petri Net Package. Version 6.0. B. R. Haverkort et al. (eds.): TOOLS 2000, Lecture Notes in Computer Science 1786, pp 354–357, Springer-Verlag Heidelberg, 2000.Google Scholar
  23. 23.
    J. J. Horning, H. C. Lauer, P. M. Melliar-Smith and B. Randell. A Program Structure for Error Detection and Recovery. Lecture Notes in Computer Science, 16:177–193, 1974.Google Scholar
  24. 24.
    Y. Huang, P. Jalote and C. Kintala. Two Techniques for Transient Software Error Recovery. Lecture Notes in Computer Science, Vol.774, pp 159–170. Springer Verlag, Berlin, 1994.Google Scholar
  25. 25.
    Y. Huang, C. Kintala, N. Kolettis and N. D. Fulton. Software Rejuvenation: Analysis, Module and Applications. In Proc. of 25th Symp. on Fault Tolerant Computing, pp 381–390, Pasadena, CA, June 1995.Google Scholar
  26. 26.
    IBM Netfinity Director Software Rejuvenation-White Paper. IBM Corporation, Research Triangle Park, NC, January 2001.Google Scholar
  27. 27.
    P. Jalote, Y. Huang and C. Kintala. A Framework for Understanding and Handling Transient Software Failures. In Proc. 2nd ISSAT Int. Conf. on Reliability and Quality in Design, Orlando, FL, 1995.Google Scholar
  28. 28.
    J. C. Laprie, J. Arlat, C. Béounes, K. Kanoun and C. Hourtolle. Hardware and Software Fault Tolerance: Definition and Analysis of Architectural Solutions. In Proc. of 17th Symp. on Fault Tolerant Computing, pp 116–121, Pittsburgh, PA, 1987.Google Scholar
  29. 29.
    J. C. Laprie (Ed.). Dependability: Basic Concepts and Terminology. Springer-Verlag, Wien, New York, 1992.zbMATHGoogle Scholar
  30. 30.
    I. Lee and R. K. Iyer. Software Dependability in the Tandem GUARDIAN System. IEEE Trans. on Software Engineering, pp 455–467, Vol. 21, No. 5, May 1995.CrossRefGoogle Scholar
  31. 31.
    L. Li, K. Vaidyanathan and K. S. Trivedi. An Approach to Estimation of Software Aging in a Web Server. In Proc. of the Int. Symp. on Empirical Software Engineering, ISESE 2002, Nara, Japan, October 2002 (to appear).Google Scholar
  32. 32.
    E. Marshall. Fatal Error: How Patriot Overlooked a Scud. Science, pp 1347, March 13 1992.Google Scholar
  33. 33.
    D. Mosberger and T. Jin. Httperf-A Tool for Measuring Web Server Performance In First Workshop on Internet Server Performance, WISP, Madison, WI, pp.59–67, June 1998.Google Scholar
  34. 34.
    A. Pfening, S. Garg, A. Puliafito, M. Telek and K. S. Trivedi. Optimal Rejuvenation for Tolerating Soft Failures. Performance Evaluation,27& 28, pp 491–506, October 1996.Google Scholar
  35. 35.
    D. K. Pradhan. Fault-Tolerant Computer System Design. Prentice Hall, Englewood Cliffs, NJ, 1996.Google Scholar
  36. 36.
    S. M. Ross. Stochastic Processes. John Wiley & Sons, New York, 1983.zbMATHGoogle Scholar
  37. 37.
    R. A. Sahner, K. S. Trivedi, A. Puliafito. Performance and Reliability Analysis of Computer Systems-An Example-Based Approach Using the SHARPE Software Package. Kluwer Academic Publishers, Norwell, MA, 1996.zbMATHGoogle Scholar
  38. 38.
    R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications, Springer-Verlag, New York, 2000.zbMATHGoogle Scholar
  39. 39.
    K. Smith and M. Seltzer. File System Aging-Increasing the Relevance of File System Benchmarks In Proc. of ACM SIGMETRICS, June 1997.Google Scholar
  40. 40.
    M. Sullivan and R. Chillarege. Software Defects and Their Impact on System Availability-A Study of Field Failures in Operating Systems. In Proc. 21st IEEE Int. Symp. on Fault Tolerant Computing, pages 2–9, 1991.Google Scholar
  41. 41.
    A. T. Tai, S. N. Chau, L. Alkalaj, and H. Hecht. On-board Preventive Maintenance: Analysis of Effectiveness and Optimal Duty Period. In Proc. of 3rd Int. Workshop on Object-oriented Real-time Dependable Systems, Newport Beach, California, February 1997.Google Scholar
  42. 42.
    K. S. Trivedi, J. Muppala, S. Woolet and B. R. Haverkort. Composite Performance and Dependability Analysis. Performance Evaluation, Vol. 14, Nos. 3–4, pp 197–216, February 1992.CrossRefzbMATHGoogle Scholar
  43. 43.
    K. S. Trivedi. Probability and Statistics, with Reliability, Queuing and Computer Science Applications, 2nd edition. John Wiley, 2001.Google Scholar
  44. 44.
    K. Vaidyanathan and K. S. Trivedi. A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems. In Proc. of the Tenth IEEE Int. Symp. on Software Reliability Engineering, pp 84–93, Boca Raton, FL, November 1999.Google Scholar
  45. 45.
    K. Vaidyanathan, R. E. Harper, S. W. Hunter, K. S. Trivedi. Analysis and Implementation of Software Rejuvenation in Cluster Systems. In Proc. of the Joint Int. Conf. on Measurement and Modeling of Computer Systems, ACM SIGMETRICS 2001/Performance 2001, Cambridge, MA, June 2001.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Kishor S. Trivedi
    • 1
  • Kalyanaraman Vaidyanathan
    • 1
  1. 1.Dept. of Electrical & Computer EngineeringDuke UniversityDurhamUSA

Personalised recommendations