Measurement-Based Analysis of System Dependability Using Fault Injection and Field Failure Data

  • Ravishankar K. Iyer
  • Zbigniew Kalbarczyk
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2459)


The discussion in this paper focuses on the issues involved in analyzing the availability of networked systems using fault injection and the failure data collected by the logging mechanisms built into the system. In particular we address: (1) analysis in the prototype phase using physical fault injection to an actual system. We use example of fault injection-based evaluation of a software-implemented fault tolerance (SIFT) environment (built around a set of self-checking processes called ARMORS) that provides error detection and recovery services to spaceborne scientific applications and (2) measurement-based analysis of systems in the field. We use example of LAN of Windows NT based computers to present methods for collecting and analyzing failure data to characterize network system dependability. Both, fault injection and failure data analysis enable us to study naturally occurring errors and to provide feedback to system designers on potential availability bottlenecks. For example, the study of failures in a network of Windows NT machines reveals that most of the problems that lead to reboots are software related and that though the average availability evaluates to over 99%, a typical machine, on average, provides acceptable service only about 92% of the time.


Recovery Service Armor Process Crash Failure Actual Execution Time Correlate Failure 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    J. Arlat, et al., “Fault Injection for Dependability Validation-A Methodology and Some Applications,” IEEE Trans. On Software Engineering, Vol. 16, No. 2, pp. 166–182, Feb. 1990.CrossRefGoogle Scholar
  2. 2.
    J. Arlat, et al., “Fault Injection and Dependability Evaluation of Fault-Tolerant Systems,” IEEE Trans. On Computers, Vol. 42, No. 8, pp.913–923, Aug. 1993.CrossRefGoogle Scholar
  3. 3.
    D. Avresky, et al., “Fault Injection for the Formal Testing of Fault Tolerance,” Proc. 22nd Int. Symp. Fault-Tolerant Computing, pp. 345–354, June 1992.Google Scholar
  4. 4.
    S. Bagchi, “Hierarchical error detection in a software-implemented fault tolerance (SIFT) environment,” Ph.D. Thesis, University of Illinois, Urbana, IL, 2001.Google Scholar
  5. 5.
    J.H. Barton, E.W. Czeck, Z.Z. Segall, and D.P. Siewiorek, “Fault injection experiments using FIAT,” IEEE Trans. Computers, Vol.39, pp.575–582, Apr. 1990.CrossRefGoogle Scholar
  6. 6.
    X. Castillo and D.P. Siewiorek, "A Workload Dependent Software Reliability Prediction Model," Proc. 12th Int. Symp. Fault-Tolerant Computing, pp.279–286, 1982.Google Scholar
  7. 7.
    R. Chillarege, S. Biyani, and J. Rosenthal, "Measurement Of Failure Rate in Widely Distributed Software," Proc. 25th Int. Symp. Fault-Tolerant Computing, pp. 424–433, 1995.Google Scholar
  8. 8.
    J. Gray, “A Census of Tandem System Availability between 1985 and 1990,” IEEE Trans. Reliability, Vol. 39, No. 4, pp. 409–418, 1990.CrossRefGoogle Scholar
  9. 9.
    M.C. Hsueh, R.K. Iyer, and K.S. Trivedi, "Performability Modeling Based on Real Data: A Case Study," IEEE Trans. Computers, Vol. 37, No.4, pp. 478–484, April 1988.CrossRefGoogle Scholar
  10. 10.
    R. Iyer, D. Tang, “Experimental Analysis of Computer System Dependability,” Chapter 5 in Fault Tolerant Computer Design, D.K. Pradhan, Prentice Hall, pp.282–392, 1996.Google Scholar
  11. 11.
    R.K. Iyer and D.J. Rossetti, “Effect of System Workload on Operating System Reliability: A Study on IBM 3081,” IEEE Trans. Software Engineering, Vol. SE-11, No. 12, pp. 1438–1448, 1985.CrossRefGoogle Scholar
  12. 12.
    M. Kalyanakrishnam, “Failure Data Analysis of LAN of Windows NT Based Computers,” Proc. 18th Symp. on Reliable Distributed Systems, pp.178–187, October 1999.Google Scholar
  13. 13.
    Z. Kalbarczyk, R. Iyer, S. Bagchi, K. Whisnant, “Chameleon: A software infrastructure for adaptive fault tolerance,” IEEE Trans. on Parallel and Distributed Systems, vol. 10, no. 6, pp. 560–579, 1999.CrossRefGoogle Scholar
  14. 14.
    G.A. Kanawati, N.A. Kanawati, and J.A. Abraham, “FERRARI: A flexible software-based fault and error injection system,” IEEE Trans. Computers, Vol.44, pp.248–260, Feb. 1995.Google Scholar
  15. 15.
    I. Lee and R.K. Iyer, “Analysis of Software Halts in Tandem System,” Proc. 3rd Int. Symp. Software Reliability Engineering, pp. 227–236, 1992.Google Scholar
  16. 16.
    I. Lee and R.K. Iyer, “Software Dependability in the Tandem GUARDIAN Operating System,” IEEE Trans. on Software Engineering, Vol. 21, No. 5, pp. 455–467, 1995.CrossRefGoogle Scholar
  17. 17.
    T.T. Lin, D.P. Siewiorek, “Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis,” IEEE Trans. Reliability, Vol. 39, No. 4, pp.419–432, 1990.CrossRefGoogle Scholar
  18. 18.
    H. Maderia, R. Some, F. Moereira, D. Costa, D. Rennels, “Experimental evaluation of a COTS system for space applications,” Proc. Of Int. Conf. On Dependable Systems and Networks (DSN’ 02), Washington DC, pp.325–330, June 2002.Google Scholar
  19. 19.
    Message Passing Interface Forum, “MPI-2: Extensions to the Message Passing Interface,”
  20. 20.
    J.F. Meyer and L. Wei, “Analysis of Workload Influence on Dependability” Proc. 18th Int. Symp. Fault-Tolerant Computing, pp.84–89, 1988.Google Scholar
  21. 21.
    S. Mourad and D. Andrews, “On the Reliability of the IBM MVS/XA Operating System,” IEEE Trans. on Software Engineering, October 1987.Google Scholar
  22. 22.
    D. Stott, B. Floering, Z. Kalbarczyk, and R. Iyer, “Dependability assessment in distributed systems with lightweight fault injectors in NFTAPE,” Proc. Int. Performance and Dependability Symposium, IPDS-00, pp. 91–100, 2000.Google Scholar
  23. 23.
    M.S. Sullivan, R. Chillarege,“Software Defects and Their Impact on System Availability — A Study of Field Failures in Operating Systems,” Proc. 21st Int. Symp. Fault-Tolerant Computing, pp. 2–9, 1991.Google Scholar
  24. 24.
    M.S. Sullivan and R. Chillarege, “A Comparison of Software Defects in Database Management Systems and Operating Systems,” Proc. 22nd Int. Symp. Fault-Tolerant Computing, pp.475–484, 1992.Google Scholar
  25. 25.
    D. Tang and R.K. Iyer, “Analysis of the VAX/VMS Error Logs in Multicomputer Environments — A Case Study of Software Dependability,” Proc. 3rd Int. Symp. Software Reliability Engineering, Research Triangle Park, North Carolina, pp. 216–226, October 1992.Google Scholar
  26. 26.
    D. Tang and R.K. Iyer, "Dependability Measurement and Modeling of a Multicomputer Systems," IEEE Trans. Computers, Vol. 42, No. 1, pp.62–75, January 1993.CrossRefGoogle Scholar
  27. 27.
    A. Thakur, R.K. Iyer, L. Young, I. Lee, "Analysis of Failures in the Tandem NonStop-UX Operating System," Proc. Int’l Symp. Software Reliability Engineering, pp. 40–49, 1995.Google Scholar
  28. 28.
    M.M. Tsao and D.P. Siewiorek, “Trend Analysis on System Error files,” Proc. 13th Int. Symp. Fault-Tolerant Computing, pp. 116–119, June 1983.Google Scholar
  29. 29.
    P. Velardi and R.K. Iyer, “A Study of Software Failures and Recovery in the MVS Operating System” IEEE Trans. On Computers, Vol. C-33, No. 6, pp.564–568, June 1984.CrossRefGoogle Scholar
  30. 30.
    K. Whisnant, Z. Kalbarczyk, and R. Iyer, “Micro-checkpointing: Checkpointing for multithreaded applications,” in Proceedings of the 6th International On-Line Testing Workshop, July 2000.Google Scholar
  31. 31.
    K. Whisnant, R. Iyer, Z. Kalbarczyk, P. Jones, “An Experimental Evaluation of the ARMOR-based REE Software-Implemented Fault Tolerance Environment,” pending technical report, University of Illinois, Urbana, IL, 2001.Google Scholar
  32. 32.
    K. Whisnant, et al., “An Experimental Evaluation of the REE SIFTEnvironment for Spaceborne Applications,” Proc. Of Int. Conf. On Dependable Systems and Networks (DSN’ 02), Washington DC, pp. 585–594, June 2002.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Ravishankar K. Iyer
    • 1
  • Zbigniew Kalbarczyk
    • 1
  1. 1.Center for Reliable and High-Performance ComputingUniversity of Illinois at Urbana-ChampaignUrbana

Personalised recommendations