Abstract
The discussion in this paper focuses on the issues involved in analyzing the availability of networked systems using fault injection and the failure data collected by the logging mechanisms built into the system. In particular we address: (1) analysis in the prototype phase using physical fault injection to an actual system. We use example of fault injection-based evaluation of a software-implemented fault tolerance (SIFT) environment (built around a set of self-checking processes called ARMORS) that provides error detection and recovery services to spaceborne scientific applications and (2) measurement-based analysis of systems in the field. We use example of LAN of Windows NT based computers to present methods for collecting and analyzing failure data to characterize network system dependability. Both, fault injection and failure data analysis enable us to study naturally occurring errors and to provide feedback to system designers on potential availability bottlenecks. For example, the study of failures in a network of Windows NT machines reveals that most of the problems that lead to reboots are software related and that though the average availability evaluates to over 99%, a typical machine, on average, provides acceptable service only about 92% of the time.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
J. Arlat, et al., “Fault Injection for Dependability Validation-A Methodology and Some Applications,” IEEE Trans. On Software Engineering, Vol. 16, No. 2, pp. 166–182, Feb. 1990.
J. Arlat, et al., “Fault Injection and Dependability Evaluation of Fault-Tolerant Systems,” IEEE Trans. On Computers, Vol. 42, No. 8, pp.913–923, Aug. 1993.
D. Avresky, et al., “Fault Injection for the Formal Testing of Fault Tolerance,” Proc. 22nd Int. Symp. Fault-Tolerant Computing, pp. 345–354, June 1992.
S. Bagchi, “Hierarchical error detection in a software-implemented fault tolerance (SIFT) environment,” Ph.D. Thesis, University of Illinois, Urbana, IL, 2001.
J.H. Barton, E.W. Czeck, Z.Z. Segall, and D.P. Siewiorek, “Fault injection experiments using FIAT,” IEEE Trans. Computers, Vol.39, pp.575–582, Apr. 1990.
X. Castillo and D.P. Siewiorek, "A Workload Dependent Software Reliability Prediction Model," Proc. 12th Int. Symp. Fault-Tolerant Computing, pp.279–286, 1982.
R. Chillarege, S. Biyani, and J. Rosenthal, "Measurement Of Failure Rate in Widely Distributed Software," Proc. 25th Int. Symp. Fault-Tolerant Computing, pp. 424–433, 1995.
J. Gray, “A Census of Tandem System Availability between 1985 and 1990,” IEEE Trans. Reliability, Vol. 39, No. 4, pp. 409–418, 1990.
M.C. Hsueh, R.K. Iyer, and K.S. Trivedi, "Performability Modeling Based on Real Data: A Case Study," IEEE Trans. Computers, Vol. 37, No.4, pp. 478–484, April 1988.
R. Iyer, D. Tang, “Experimental Analysis of Computer System Dependability,” Chapter 5 in Fault Tolerant Computer Design, D.K. Pradhan, Prentice Hall, pp.282–392, 1996.
R.K. Iyer and D.J. Rossetti, “Effect of System Workload on Operating System Reliability: A Study on IBM 3081,” IEEE Trans. Software Engineering, Vol. SE-11, No. 12, pp. 1438–1448, 1985.
M. Kalyanakrishnam, “Failure Data Analysis of LAN of Windows NT Based Computers,” Proc. 18th Symp. on Reliable Distributed Systems, pp.178–187, October 1999.
Z. Kalbarczyk, R. Iyer, S. Bagchi, K. Whisnant, “Chameleon: A software infrastructure for adaptive fault tolerance,” IEEE Trans. on Parallel and Distributed Systems, vol. 10, no. 6, pp. 560–579, 1999.
G.A. Kanawati, N.A. Kanawati, and J.A. Abraham, “FERRARI: A flexible software-based fault and error injection system,” IEEE Trans. Computers, Vol.44, pp.248–260, Feb. 1995.
I. Lee and R.K. Iyer, “Analysis of Software Halts in Tandem System,” Proc. 3rd Int. Symp. Software Reliability Engineering, pp. 227–236, 1992.
I. Lee and R.K. Iyer, “Software Dependability in the Tandem GUARDIAN Operating System,” IEEE Trans. on Software Engineering, Vol. 21, No. 5, pp. 455–467, 1995.
T.T. Lin, D.P. Siewiorek, “Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis,” IEEE Trans. Reliability, Vol. 39, No. 4, pp.419–432, 1990.
H. Maderia, R. Some, F. Moereira, D. Costa, D. Rennels, “Experimental evaluation of a COTS system for space applications,” Proc. Of Int. Conf. On Dependable Systems and Networks (DSN’ 02), Washington DC, pp.325–330, June 2002.
Message Passing Interface Forum, “MPI-2: Extensions to the Message Passing Interface,” http://www.mpi-forum.org/docs/mpi-20.ps.
J.F. Meyer and L. Wei, “Analysis of Workload Influence on Dependability” Proc. 18th Int. Symp. Fault-Tolerant Computing, pp.84–89, 1988.
S. Mourad and D. Andrews, “On the Reliability of the IBM MVS/XA Operating System,” IEEE Trans. on Software Engineering, October 1987.
D. Stott, B. Floering, Z. Kalbarczyk, and R. Iyer, “Dependability assessment in distributed systems with lightweight fault injectors in NFTAPE,” Proc. Int. Performance and Dependability Symposium, IPDS-00, pp. 91–100, 2000.
M.S. Sullivan, R. Chillarege,“Software Defects and Their Impact on System Availability — A Study of Field Failures in Operating Systems,” Proc. 21st Int. Symp. Fault-Tolerant Computing, pp. 2–9, 1991.
M.S. Sullivan and R. Chillarege, “A Comparison of Software Defects in Database Management Systems and Operating Systems,” Proc. 22nd Int. Symp. Fault-Tolerant Computing, pp.475–484, 1992.
D. Tang and R.K. Iyer, “Analysis of the VAX/VMS Error Logs in Multicomputer Environments — A Case Study of Software Dependability,” Proc. 3rd Int. Symp. Software Reliability Engineering, Research Triangle Park, North Carolina, pp. 216–226, October 1992.
D. Tang and R.K. Iyer, "Dependability Measurement and Modeling of a Multicomputer Systems," IEEE Trans. Computers, Vol. 42, No. 1, pp.62–75, January 1993.
A. Thakur, R.K. Iyer, L. Young, I. Lee, "Analysis of Failures in the Tandem NonStop-UX Operating System," Proc. Int’l Symp. Software Reliability Engineering, pp. 40–49, 1995.
M.M. Tsao and D.P. Siewiorek, “Trend Analysis on System Error files,” Proc. 13th Int. Symp. Fault-Tolerant Computing, pp. 116–119, June 1983.
P. Velardi and R.K. Iyer, “A Study of Software Failures and Recovery in the MVS Operating System” IEEE Trans. On Computers, Vol. C-33, No. 6, pp.564–568, June 1984.
K. Whisnant, Z. Kalbarczyk, and R. Iyer, “Micro-checkpointing: Checkpointing for multithreaded applications,” in Proceedings of the 6th International On-Line Testing Workshop, July 2000.
K. Whisnant, R. Iyer, Z. Kalbarczyk, P. Jones, “An Experimental Evaluation of the ARMOR-based REE Software-Implemented Fault Tolerance Environment,” pending technical report, University of Illinois, Urbana, IL, 2001.
K. Whisnant, et al., “An Experimental Evaluation of the REE SIFTEnvironment for Spaceborne Applications,” Proc. Of Int. Conf. On Dependable Systems and Networks (DSN’ 02), Washington DC, pp. 585–594, June 2002.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Iyer, R.K., Kalbarczyk, Z. (2002). Measurement-Based Analysis of System Dependability Using Fault Injection and Field Failure Data. In: Calzarossa, M.C., Tucci, S. (eds) Performance Evaluation of Complex Systems: Techniques and Tools. Performance 2002. Lecture Notes in Computer Science, vol 2459. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45798-4_13
Download citation
DOI: https://doi.org/10.1007/3-540-45798-4_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44252-3
Online ISBN: 978-3-540-45798-5
eBook Packages: Springer Book Archive