Abstract
This paper discusses methodologies and advances in measurement-based dependability evaluation of operational computer systems. Research work over the past 15 years in this area is briefly reviewed. Methodologies are illustrated through discussion of authors’ representative studies. Specifically, measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software dependability, and fault diagnosis are addressed. The discussion covers methods used in the area and several important issues previously studied, including workload/failure dependency, correlated failures, and software fault tolerance.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
B.E. Aupperle, J.F. Meyer and L. Wei, “Evaluation of Fault-Tolerant Systems with Nonhomogeneous Workloads,” Proc. 19th Int. Symp. Fault-Tolerant Computing, pp. 159–166, June 1989.
A. Avizienis and J.P.J. Kelly, “Fault Tolerance by Design Diversity: Concepts and Experiments,” IEEE Computer, pp. 67–80, Aug. 1984.
J.F. Bartlett, “A ‘Nonstop’ Operating System,” Proc. Int. Hawaii Conf. System Science, pp. 103–117, 1978.
P.G. Bishop and F.D. Pullen, “PODS Revisited — A Study of Software Failure Behavior,” Proc. 18th Int. Symp. Fault-Tolerant Computing, pp. 2–8, 1988.
S.E. Butner and R.K. Iyer, “A Statistical Study of Reliability and System Load at SLAC,” Proc. 10th Int. Symp. Fault-Tolerant Computing, pp. 207–209, Oct. 1980.
X. Castillo and D.P. Siewiorek, “Workload, Performance, and Reliability of Digital Computer Systems,” Proc. 11th Int. Symp. Fault-Tolerant Computing, pp. 84–89, July 1981.
X. Castillo and D.P. Siewiorek, “A Workload Dependent Software Reliability Prediction Model,” Proc. 12th Int. Symp. Fault-Tolerant Computing, pp. 279–286, June 1982.
W.R. Dillon and M. Goldstein, Multivariate Analysis, John Wiley & Sons, 1984.
J.B. Dugan, “Correlated Hardware Failures in Redundant Systems,” Proc. 2nd IFIP Working Conf. Dependable Computing for Critical Applications, Tucson, Arizona, Feb. 1991.
J. Dunkel, “On the Modeling of Workload-Dependent Memory Faults,” Proc. 20th Int. Symp. Fault-Tolerant Computing, pp. 348–355, June 1990.
A.L. Goel, “Software Reliability Models: Assumptions, Limitations, and Applicability,” IEEE Trans. Software Engineering, Vol SE-11, No. 12, pp. 1411–1423, Dec. 1985.
A. Goyal, S.S. Lavenberg and K.S. Trivedi, “Probabilistic Modeling of Computer System Availability,” Annals of Operations Research, No. 8, pp. 285–306, March 1987.
J. Gray, “A Census of Tandem System Availability Between 1985 and 1990,” IEEE Trans. Reliability, Vol. 39, No. 4, pp. 409–418, Oct. 1990.
J.P. Hansen and D.P. Siewiorek, “Models for Time Coalescence in Event Logs,” Proc. 22nd Int. Symp. Fault-Tolerant Computing, pp. 221–227, July 1992.
D.I. Heimann, N. Mittal and K.S. Trivedi, “Availability and Reliability Modeling for Computer Systems,” Advances in Computers, Vol. 31, pp. 175–233, 1990.
R.A. Howard, Dynamic Probabilistic Systems, John Wiley & Sons, Inc., New York, 1971.
M.C. Hsueh and R.K. Iyer, “A Measurement-Based Model of Software Reliability in a Production Environment,” Proc. 11th Annual Int. Computer Software & Applications Conf., pp. 354–360, Oct. 1987.
M.C. Hsueh, R.K. Iyer, and K.S. Trivedi, “Performability Modeling Based on Real Data: A Case Study,” IEEE Trans. Computers, Vol. 37, No.4, pp. 478–484, April 1988.
R.K. Iyer and D.J. Rossetti, “A Statistical Load Dependency Model for CPU Errors at SLAC,” Proc. 12th Int. Symp. Fault-Tolerant Computing, pp. 363–372, June 1982.
R.K. Iyer, S.E. Butner, and E.J. McCluskey, “A Statistical Failure/Load Relationship: Results of a Multicomputer Study,” IEEE Trans. Computers, Vol. C-31, No. 7, pp. 697–705, July 1982.
R.K. Iyer and P. Velardi, “Hardware-Related Software Errors: Measurement and Analysis,” IEEE Trans. Software Engineering, Vol. SE-11, No. 2, pp. 223–231, Feb. 1985.
R.K. Iyer and D.J. Rossetti, “Effect of System Workload on Operating System Reliability: A Study on IBM 3081,” IEEE Trans. Software Engineering, Vol. SE-11, No. 12, pp. 1438–1448, Dec. 1985.
R.K. Iyer, D.J. Rossetti and M.C. Hsueh, “Measurement and Modeling of Computer Reliability as Affected by System Activity,” ACM Trans. Computer Systems, Vol. 4, No. 3, pp. 214–237, Aug. 1986.
R.K. Iyer, L.T. Young, and P.V.K. Iyer, “Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data,” IEEE Trans. Computers, Vol. 39, No. 4, pp. 525–537, April 1990.
I. Lee, R.K. Iyer and D. Tang, “Error/Failure Analysis Using Event Logs from Fault Tolerant Systems,” Proc. 21st Int. Symp. Fault-Tolerant Computing, pp. 10–17, June 1991.
I. Lee and R.K. Iyer, “Analysis of Software Halts in Tandem System,” Proc. 3rd Int. Symp. Software Reliability Engineering, pp. 227–236, Oct. 1992.
I. Lee, D. Tang, R.K. Iyer, and M.C. Hsueh, “Measurement-Based Evaluation of Operating System Fault Tolerance,” IEEE Transactions on Reliability, pp. 238–249, June 1993.
I. Lee and R.K. Iyer, “Faults, Symptoms, and Software Fault Tolerance in the Tandem GUARDIAN90 Operating System,” Proc. 23rd Int. Symp. Fault-Tolerant Computing, June 1993.
T.T. Lin and D.P. Siewiorek, “Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis,” IEEE Trans. Reliability, Vol. 39, No. 4, pp. 419–432, Oct. 1990.
B. Littlewood, “Theories of Software Reliability: How Good Are They and How Can They Be Improved?” IEEE Trans. Software Engineering, Vol. SE-6, No. 5, pp. 489–500, Sept. 1980.
R.A. Maxion, “Anomaly Detection for Diagnosis,” Proc. 20th Int. Symp. Fault-Tolerant Computing, pp. 20–27, June 1990.
R.A. Maxion and F.E. Feather, “A Case Study of Ethernet Anomalies in a Distributed Computing Environment,” IEEE Trans. Reliability, Vol. 39, No. 4, pp. 433–443, Oct. 1990.
R.A. Maxion and R.T. Olszewski, “Detection and Discrimination of Injected Network Faults,” Proc. 23rd Int. Symp. Fault-Tolerant Computing, pp. 198–207, June 1993.
S.R. McConnel, D.P. Siewiorek, and M.M. Tsao, “The Measurement and Analysis of Transient Errors in Digital Compute Systems,” Proc. 9th Int. Symp. Fault-Tolerant Computing, pp. 67–70, 1979.
J.F. Meyer, “On Evaluating the Performability of Degradable Computing Systems,” IEEE Trans. Computers, Vol. C-29, No. 8, pp. 720–731, Aug. 1980.
J.F. Meyer and L. Wei, “Analysis of Workload Influence on Dependability,” Proc. 18th Int. Symp. Fault-Tolerant Computing, pp. 84–89, June 1988.
J.F. Meyer, “Performability: A Retrospective and Some Pointers to the Future,” Performance Evaluation, Vol. 14, pp. 139–156, Feb. 1992.
S. Mourad and D. Andrews, “On the Reliability of the IBM MVS/XA Operating System,” IEEE Trans. Software Engineering, Vol. SE-13, No. 10, pp. 1135–1139, Oct. 1987.
J.D. Musa, A. Iannino, and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill Book Company, 1987.
B. Randell, “System Structure for Software Fault Tolerance,” IEEE Trans. Software Engineering, Vol. SE-1, No. 2, June 1975.
A. Reibman, R. Smith, and K. Trivedi, “Markov and Markov Reward Model Transient Analysis: An Overview of Numerical Approaches,” European Journal of Operational Research, Vol. 40, pp. 257–267, 1989.
S.M. Ross, Introduction to Probability Models, 3rd Edition, Academic Press, Inc., 1985.
R.A. Sahner and K.S. Trivedi, “Reliability Modeling Using SHARPE,” IEEE Trans. Reliability, Vol. R-36, No. 2, pp. 186–193, June 1987.
D.P. Siewiorek, V. Kini, H. Mashburn, S.R. McConnel, and M. Tsao, “A Case Study of C.mmp, Cm*, and C.vmp: Part I — Experience with Fault Tolerance in Multiprocessor Systems,” Proc. of the IEEE, Vol. 66, No. 10, pp. 1178–1199, Oct. 1978.
D.P. Siewiorek and R.W. Swarz, Reliable Computer Systems: Design and Evaluation, Digital Press, Bedford, Mass., 1992.
M.S. Sullivan and R. Chillarege, “Software Defects and Their Impact on System Availability — A Study of Field Failures in Operating Systems,” Proc. 21st Int. Symp. Fault-Tolerant Computing, pp. 2–9, June 1991.
M.S. Sullivan and R. Chillarege, “A Comparison of Software Defects in Database Management Systems and Operating Systems,” Proc. 22nd Int. Symp. Fault-Tolerant Computing, pp. 475–484, July 1992.
D. Tang, R.K. Iyer and Sujatha Subramani, “Failure Analysis and Modeling of a VAXcluster System,” Proc. 20th Int. Symp. Fault-Tolerant Computing, pp. 244–251, June 1990.
D. Tang and R. K. Iyer, “Impact of Correlated Failures on Dependability in a VAXcluster System,” Proc. 2nd IFIP Working Conf. Dependable Computing for Critical Applications, Tucson, Arizona, Feb. 1991.
D. Tang and R.K. Iyer, “Analysis and Modeling of Correlated Failures in Multicomputer Systems,” IEEE Trans. Computers, Vol. 41, No. 5, pp. 567–577, May 1992.
D. Tang and R.K. Iyer, “Analysis of the VAX/VMS Error Logs in Multicomputer Environments — A Case Study of Software Dependability,” Proc. Third Int. Symp. Software Reliability Engineering, Research Triangle Park, North Carolina, pp. 216–226, Oct. 1992.
D. Tang and R.K. Iyer, “Dependability Measurement and Modeling of a Multicomputer Systems,” IEEE Trans. Computers, Vol. 42, No. 1, pp. 62–75, Jan. 1993.
D. Tang and R.K. Iyer, “MEASURE+ — A Measurement-Based Dependability Analysis Package,” Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, Santa Clara, California, pp. 110–121, May 1993.
K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications, Prentice-Hall, Englewood Cliffs, NJ, 1982.
K.S. Trivedi, J.K. Muppala, S.P. Woolet, and B.R. Haverkort, “Composite Performance and Dependability Analysis,” Performance Evaluation, Vol. 14, pp. 197–215, Feb. 1992.
M.M. Tsao and D.P. Siewiorek, “Trend Analysis on System Error files,” Proc. 13th Int. Symp. Fault-Tolerant Computing, pp. 116–119, June 1983.
P. Velardi and R.K. Iyer, “A Study of Software Failures and Recovery in the MVS Operating System,” IEEE Trans. Computers, Vol. C-33, No. 6, pp. 564–568, June 1984.
A.S. Wein and A. Sathaye, “Validating Complex Computer System Availability Models,” IEEE Trans. Reliability, Vol. 39, No. 4, pp. 468–479, Oct. 1990.
Rights and permissions
Copyright information
© 1994 Kluwer Academic Publishers
About this chapter
Cite this chapter
Iyer, R.K., Tang, D. (1994). Measurement-Based Dependability Evaluation of Operational Computer Systems. In: Foundations of Dependable Computing. The Springer International Series in Engineering and Computer Science, vol 283. Springer, Boston, MA. https://doi.org/10.1007/978-0-585-27377-8_7
Download citation
DOI: https://doi.org/10.1007/978-0-585-27377-8_7
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-9484-6
Online ISBN: 978-0-585-27377-8
eBook Packages: Springer Book Archive