Advertisement

Measurement-Based Dependability Evaluation of Operational Computer Systems

  • Ravishankar K. Iyer
  • Dong Tang
Part of the The Springer International Series in Engineering and Computer Science book series (SECS, volume 283)

Abstract

This paper discusses methodologies and advances in measurement-based dependability evaluation of operational computer systems. Research work over the past 15 years in this area is briefly reviewed. Methodologies are illustrated through discussion of authors’ representative studies. Specifically, measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software dependability, and fault diagnosis are addressed. The discussion covers methods used in the area and several important issues previously studied, including workload/failure dependency, correlated failures, and software fault tolerance.

Keywords

Fault Diagnosis Fault Tolerance Software Reliability Error Recovery Error Group 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    B.E. Aupperle, J.F. Meyer and L. Wei, “Evaluation of Fault-Tolerant Systems with Nonhomogeneous Workloads,” Proc. 19th Int. Symp. Fault-Tolerant Computing, pp. 159–166, June 1989.Google Scholar
  2. [2]
    A. Avizienis and J.P.J. Kelly, “Fault Tolerance by Design Diversity: Concepts and Experiments,” IEEE Computer, pp. 67–80, Aug. 1984.Google Scholar
  3. [3]
    J.F. Bartlett, “A ‘Nonstop’ Operating System,” Proc. Int. Hawaii Conf. System Science, pp. 103–117, 1978.Google Scholar
  4. [4]
    P.G. Bishop and F.D. Pullen, “PODS Revisited — A Study of Software Failure Behavior,” Proc. 18th Int. Symp. Fault-Tolerant Computing, pp. 2–8, 1988.Google Scholar
  5. [5]
    S.E. Butner and R.K. Iyer, “A Statistical Study of Reliability and System Load at SLAC,” Proc. 10th Int. Symp. Fault-Tolerant Computing, pp. 207–209, Oct. 1980.Google Scholar
  6. [6]
    X. Castillo and D.P. Siewiorek, “Workload, Performance, and Reliability of Digital Computer Systems,” Proc. 11th Int. Symp. Fault-Tolerant Computing, pp. 84–89, July 1981.Google Scholar
  7. [7]
    X. Castillo and D.P. Siewiorek, “A Workload Dependent Software Reliability Prediction Model,” Proc. 12th Int. Symp. Fault-Tolerant Computing, pp. 279–286, June 1982.Google Scholar
  8. [8]
    W.R. Dillon and M. Goldstein, Multivariate Analysis, John Wiley & Sons, 1984.Google Scholar
  9. [9]
    J.B. Dugan, “Correlated Hardware Failures in Redundant Systems,” Proc. 2nd IFIP Working Conf. Dependable Computing for Critical Applications, Tucson, Arizona, Feb. 1991.Google Scholar
  10. [10]
    J. Dunkel, “On the Modeling of Workload-Dependent Memory Faults,” Proc. 20th Int. Symp. Fault-Tolerant Computing, pp. 348–355, June 1990.Google Scholar
  11. [11]
    A.L. Goel, “Software Reliability Models: Assumptions, Limitations, and Applicability,” IEEE Trans. Software Engineering, Vol SE-11, No. 12, pp. 1411–1423, Dec. 1985.CrossRefGoogle Scholar
  12. [12]
    A. Goyal, S.S. Lavenberg and K.S. Trivedi, “Probabilistic Modeling of Computer System Availability,” Annals of Operations Research, No. 8, pp. 285–306, March 1987.Google Scholar
  13. [13]
    J. Gray, “A Census of Tandem System Availability Between 1985 and 1990,” IEEE Trans. Reliability, Vol. 39, No. 4, pp. 409–418, Oct. 1990.CrossRefGoogle Scholar
  14. [14]
    J.P. Hansen and D.P. Siewiorek, “Models for Time Coalescence in Event Logs,” Proc. 22nd Int. Symp. Fault-Tolerant Computing, pp. 221–227, July 1992.Google Scholar
  15. [15]
    D.I. Heimann, N. Mittal and K.S. Trivedi, “Availability and Reliability Modeling for Computer Systems,” Advances in Computers, Vol. 31, pp. 175–233, 1990.MathSciNetGoogle Scholar
  16. [16]
    R.A. Howard, Dynamic Probabilistic Systems, John Wiley & Sons, Inc., New York, 1971.Google Scholar
  17. [17]
    M.C. Hsueh and R.K. Iyer, “A Measurement-Based Model of Software Reliability in a Production Environment,” Proc. 11th Annual Int. Computer Software & Applications Conf., pp. 354–360, Oct. 1987.Google Scholar
  18. [18]
    M.C. Hsueh, R.K. Iyer, and K.S. Trivedi, “Performability Modeling Based on Real Data: A Case Study,” IEEE Trans. Computers, Vol. 37, No.4, pp. 478–484, April 1988.CrossRefGoogle Scholar
  19. [19]
    R.K. Iyer and D.J. Rossetti, “A Statistical Load Dependency Model for CPU Errors at SLAC,” Proc. 12th Int. Symp. Fault-Tolerant Computing, pp. 363–372, June 1982.Google Scholar
  20. [20]
    R.K. Iyer, S.E. Butner, and E.J. McCluskey, “A Statistical Failure/Load Relationship: Results of a Multicomputer Study,” IEEE Trans. Computers, Vol. C-31, No. 7, pp. 697–705, July 1982.Google Scholar
  21. [21]
    R.K. Iyer and P. Velardi, “Hardware-Related Software Errors: Measurement and Analysis,” IEEE Trans. Software Engineering, Vol. SE-11, No. 2, pp. 223–231, Feb. 1985.CrossRefGoogle Scholar
  22. [22]
    R.K. Iyer and D.J. Rossetti, “Effect of System Workload on Operating System Reliability: A Study on IBM 3081,” IEEE Trans. Software Engineering, Vol. SE-11, No. 12, pp. 1438–1448, Dec. 1985.CrossRefGoogle Scholar
  23. [23]
    R.K. Iyer, D.J. Rossetti and M.C. Hsueh, “Measurement and Modeling of Computer Reliability as Affected by System Activity,” ACM Trans. Computer Systems, Vol. 4, No. 3, pp. 214–237, Aug. 1986.CrossRefGoogle Scholar
  24. [24]
    R.K. Iyer, L.T. Young, and P.V.K. Iyer, “Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data,” IEEE Trans. Computers, Vol. 39, No. 4, pp. 525–537, April 1990.CrossRefGoogle Scholar
  25. [25]
    I. Lee, R.K. Iyer and D. Tang, “Error/Failure Analysis Using Event Logs from Fault Tolerant Systems,” Proc. 21st Int. Symp. Fault-Tolerant Computing, pp. 10–17, June 1991.Google Scholar
  26. [26]
    I. Lee and R.K. Iyer, “Analysis of Software Halts in Tandem System,” Proc. 3rd Int. Symp. Software Reliability Engineering, pp. 227–236, Oct. 1992.Google Scholar
  27. [27]
    I. Lee, D. Tang, R.K. Iyer, and M.C. Hsueh, “Measurement-Based Evaluation of Operating System Fault Tolerance,” IEEE Transactions on Reliability, pp. 238–249, June 1993.Google Scholar
  28. [28]
    I. Lee and R.K. Iyer, “Faults, Symptoms, and Software Fault Tolerance in the Tandem GUARDIAN90 Operating System,” Proc. 23rd Int. Symp. Fault-Tolerant Computing, June 1993.Google Scholar
  29. [29]
    T.T. Lin and D.P. Siewiorek, “Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis,” IEEE Trans. Reliability, Vol. 39, No. 4, pp. 419–432, Oct. 1990.CrossRefGoogle Scholar
  30. [30]
    B. Littlewood, “Theories of Software Reliability: How Good Are They and How Can They Be Improved?” IEEE Trans. Software Engineering, Vol. SE-6, No. 5, pp. 489–500, Sept. 1980.CrossRefGoogle Scholar
  31. [31]
    R.A. Maxion, “Anomaly Detection for Diagnosis,” Proc. 20th Int. Symp. Fault-Tolerant Computing, pp. 20–27, June 1990.Google Scholar
  32. [32]
    R.A. Maxion and F.E. Feather, “A Case Study of Ethernet Anomalies in a Distributed Computing Environment,” IEEE Trans. Reliability, Vol. 39, No. 4, pp. 433–443, Oct. 1990.CrossRefGoogle Scholar
  33. [33]
    R.A. Maxion and R.T. Olszewski, “Detection and Discrimination of Injected Network Faults,” Proc. 23rd Int. Symp. Fault-Tolerant Computing, pp. 198–207, June 1993.Google Scholar
  34. [34]
    S.R. McConnel, D.P. Siewiorek, and M.M. Tsao, “The Measurement and Analysis of Transient Errors in Digital Compute Systems,” Proc. 9th Int. Symp. Fault-Tolerant Computing, pp. 67–70, 1979.Google Scholar
  35. [35]
    J.F. Meyer, “On Evaluating the Performability of Degradable Computing Systems,” IEEE Trans. Computers, Vol. C-29, No. 8, pp. 720–731, Aug. 1980.Google Scholar
  36. [36]
    J.F. Meyer and L. Wei, “Analysis of Workload Influence on Dependability,” Proc. 18th Int. Symp. Fault-Tolerant Computing, pp. 84–89, June 1988.Google Scholar
  37. [37]
    J.F. Meyer, “Performability: A Retrospective and Some Pointers to the Future,” Performance Evaluation, Vol. 14, pp. 139–156, Feb. 1992.Google Scholar
  38. [38]
    S. Mourad and D. Andrews, “On the Reliability of the IBM MVS/XA Operating System,” IEEE Trans. Software Engineering, Vol. SE-13, No. 10, pp. 1135–1139, Oct. 1987.CrossRefGoogle Scholar
  39. [39]
    J.D. Musa, A. Iannino, and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill Book Company, 1987.Google Scholar
  40. [40]
    B. Randell, “System Structure for Software Fault Tolerance,” IEEE Trans. Software Engineering, Vol. SE-1, No. 2, June 1975.Google Scholar
  41. [41]
    A. Reibman, R. Smith, and K. Trivedi, “Markov and Markov Reward Model Transient Analysis: An Overview of Numerical Approaches,” European Journal of Operational Research, Vol. 40, pp. 257–267, 1989.MATHCrossRefMathSciNetGoogle Scholar
  42. [42]
    S.M. Ross, Introduction to Probability Models, 3rd Edition, Academic Press, Inc., 1985.Google Scholar
  43. [43]
    R.A. Sahner and K.S. Trivedi, “Reliability Modeling Using SHARPE,” IEEE Trans. Reliability, Vol. R-36, No. 2, pp. 186–193, June 1987.CrossRefGoogle Scholar
  44. [44]
    D.P. Siewiorek, V. Kini, H. Mashburn, S.R. McConnel, and M. Tsao, “A Case Study of C.mmp, Cm*, and C.vmp: Part I — Experience with Fault Tolerance in Multiprocessor Systems,” Proc. of the IEEE, Vol. 66, No. 10, pp. 1178–1199, Oct. 1978.Google Scholar
  45. [45]
    D.P. Siewiorek and R.W. Swarz, Reliable Computer Systems: Design and Evaluation, Digital Press, Bedford, Mass., 1992.Google Scholar
  46. [46]
    M.S. Sullivan and R. Chillarege, “Software Defects and Their Impact on System Availability — A Study of Field Failures in Operating Systems,” Proc. 21st Int. Symp. Fault-Tolerant Computing, pp. 2–9, June 1991.Google Scholar
  47. [47]
    M.S. Sullivan and R. Chillarege, “A Comparison of Software Defects in Database Management Systems and Operating Systems,” Proc. 22nd Int. Symp. Fault-Tolerant Computing, pp. 475–484, July 1992.Google Scholar
  48. [48]
    D. Tang, R.K. Iyer and Sujatha Subramani, “Failure Analysis and Modeling of a VAXcluster System,” Proc. 20th Int. Symp. Fault-Tolerant Computing, pp. 244–251, June 1990.Google Scholar
  49. [49]
    D. Tang and R. K. Iyer, “Impact of Correlated Failures on Dependability in a VAXcluster System,” Proc. 2nd IFIP Working Conf. Dependable Computing for Critical Applications, Tucson, Arizona, Feb. 1991.Google Scholar
  50. [50]
    D. Tang and R.K. Iyer, “Analysis and Modeling of Correlated Failures in Multicomputer Systems,” IEEE Trans. Computers, Vol. 41, No. 5, pp. 567–577, May 1992.CrossRefGoogle Scholar
  51. [51]
    D. Tang and R.K. Iyer, “Analysis of the VAX/VMS Error Logs in Multicomputer Environments — A Case Study of Software Dependability,” Proc. Third Int. Symp. Software Reliability Engineering, Research Triangle Park, North Carolina, pp. 216–226, Oct. 1992.Google Scholar
  52. [52]
    D. Tang and R.K. Iyer, “Dependability Measurement and Modeling of a Multicomputer Systems,” IEEE Trans. Computers, Vol. 42, No. 1, pp. 62–75, Jan. 1993.CrossRefGoogle Scholar
  53. [53]
    D. Tang and R.K. Iyer, “MEASURE+ — A Measurement-Based Dependability Analysis Package,” Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, Santa Clara, California, pp. 110–121, May 1993.Google Scholar
  54. [54]
    K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications, Prentice-Hall, Englewood Cliffs, NJ, 1982.Google Scholar
  55. [55]
    K.S. Trivedi, J.K. Muppala, S.P. Woolet, and B.R. Haverkort, “Composite Performance and Dependability Analysis,” Performance Evaluation, Vol. 14, pp. 197–215, Feb. 1992.Google Scholar
  56. [56]
    M.M. Tsao and D.P. Siewiorek, “Trend Analysis on System Error files,” Proc. 13th Int. Symp. Fault-Tolerant Computing, pp. 116–119, June 1983.Google Scholar
  57. [57]
    P. Velardi and R.K. Iyer, “A Study of Software Failures and Recovery in the MVS Operating System,” IEEE Trans. Computers, Vol. C-33, No. 6, pp. 564–568, June 1984.Google Scholar
  58. [58]
    A.S. Wein and A. Sathaye, “Validating Complex Computer System Availability Models,” IEEE Trans. Reliability, Vol. 39, No. 4, pp. 468–479, Oct. 1990.CrossRefGoogle Scholar

Copyright information

© Kluwer Academic Publishers 1994

Authors and Affiliations

  • Ravishankar K. Iyer
  • Dong Tang

There are no affiliations available

Personalised recommendations