Skip to main content

An evaluation of the error detection mechanisms in MARS using software-implemented fault injection

  • Session 2 Fault Injection
  • Conference paper
  • First Online:
Dependable Computing — EDCC-2 (EDCC 1996)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1150))

Included in the following conference series:

Abstract

The concept of fail-silent nodes greatly simplifies the design and safety proof of highly dependable fault-tolerant computer systems. The MAintainable Real-Time System (MARS) is a computer system where the hardware, operating system, and application level error detection mechanisms are designed to ensure the fail silence of nodes with a high probability.

The goal of this paper is two-fold: First, the error detection capabilities of the different mechanisms are evaluated in software-implemented fault injection experiments using the well-known bit-flip fault model. The results show that a fail silence coverage of at least 85% is achievable by the combination of hardware and system level software error detection mechanisms. With the additional use of application level error detection mechanisms a fail silence coverage of 100% was achieved.

Second, the limits of the application level error detection mechanisms are evaluated. In these experiments, the fault model consists of highly improbable residual faults to deliberately force the occurrence of fail silence violations. Despite this worst-case scenario, more than 50% of the presumed undetectable errors were detected by other mechanisms and hence did not lead to fail silence violations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. F. Cristian, H. Aghili, R. Strong, and D. Dolev. Atomic broadcast: From simple message diffusion to byzantine agreement. In Proc. 15th Int. Symposium on Fault-Tolerant Computing, pages 200–206, Silver Spring, June 1985. IEEE Computer Society.

    Google Scholar 

  2. K. Echtle, D. Hammer, and D. Powell, editors. Dependable Computing-EDCC-1, First European Dependable Computing Conference, volume 852 of Lecture Notes in Computer Science, Berlin, Germany, Oct. 1994. Springer-Verlag.

    Google Scholar 

  3. K. Echtle and M. Leu. The EFA fault injector for fault-tolerant distributed system testing. In IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, pages 28–35, Amherst, Massachusetts, USA, July 1992.

    Google Scholar 

  4. S. Han, H. A. Rosenberg, and K. G. Shin. DOCTOR: An integrateD sOftware fault injeCTiOn enviRonment. In Third IEEE Int'l Workshop on Integrating Error Models with Fault Injection, Annapolis, Maryland, USA, April 1994.

    Google Scholar 

  5. IEEE Computer Society. Proc. 18th Int. Symposium on Fault-Tolerant Computing, Tokyo, Japan, June 1988.

    Google Scholar 

  6. R. Johansson. On single event upset error manifestation. In Echtle et al. [EHP94], pages 217–231.

    Google Scholar 

  7. J. Karlsson, P. Folkesson, Jean Arlat, Yves Crouzet, and Günther Leber. Integration and comparison of three physical fault injection techniques. In Prédictably Dependable Computing Systems, chapter V: Fault Injection, pages 309–329. Springer Verlag, 1995.

    Google Scholar 

  8. H. Kopetz, P. Holzer, G. Leber, and M. Schindler. The rolling ball on MARS. Research Report 13/91, Institut für Technische Informatik, Technische Universität Wien, Vienna, Austria, Nov. 1991.

    Google Scholar 

  9. W. Kao, R.K. Iyer, and D. Tang. FINE: A fault injection and monitoring environment for tracing the UNIX system behavior under faults. IEEE Transactions on Software Engineering, SE-19(11):1105–1118, Nov. 1993.

    Google Scholar 

  10. G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. FERRARI: A flexible software-based fault and error injection system. IEEE Transactions on Computers, 44(2):248–260, Feb. 1995.

    Google Scholar 

  11. H. Kopetz, H. Kantz, G. Grünsteidl, P. Puschner, and J. Reisinger. Tolerating Transient Faults in MARS. In Proc. 20th Int. Symposium on Fault-Tolerant Computing, pages 466–473, Newcastle upon Tyne, U.K., June 1990.

    Google Scholar 

  12. T. Lovric. Systematic and design diversity — software techniques for hardware fault detection. In Echtle et al. [EHP94], pages 309–326.

    Google Scholar 

  13. H. Madeira and J.G. Silva. Experimental evaluation of the fail-silent behavior in computers without error masking. In Proc. 24th Int. Symposium on Fault-Tolerant Computing, pages 350–359, Austin, Texas, USA, June 1994. IEEE Computer Society.

    Google Scholar 

  14. H.-J. Mathony, J. Unruh, and K.-H. Kaiser. On the data integrity in automotive networks. In Electronic Systems dor Vehicles, number 819 in VDI Berichte, pages 515–539. VDI Verlag, Düsseldorf, 1990.

    Google Scholar 

  15. D. Powell, G. Bonn, D. Seaton, P. Verissimo, and F. Waeselynck. The Delta-4 approach to dependability in open distributed computing systems. [IEE88], pages 246–151.

    Google Scholar 

  16. W.W. Peterson and E.J. Weldon. Error-Correcting Codes. The M.I.T. Press, 1972. (Second Edition).

    Google Scholar 

  17. J. Reisinger. Konzeption und Analyse eines zeitgesteuerten Betriebssystems für Echtzeitanwendungen. PhD thesis, Technisch-Naturwissenschaftliche Fakultät, Technische Universität Wien, Wien, Österreich, Juli 1993.

    Google Scholar 

  18. J. Reisinger and A. Steininger. The design of a fail-silent processing node for MARS. Distributed Systems Engineering Journal, 1994.

    Google Scholar 

  19. J. Reisinger, A. Steininger, and G. Leber. The PDCS implementation of MARS hardware and software. In Predictably Dependable Computing Systems, pages 209–224. Springer Verlag, 1995.

    Google Scholar 

  20. S.K. Shrivastava, P.D. Ezhilchelvan, N.A. Speirs, S. Tao, and A. Tully. Principal features of the VOLTAN family of reliable node architectures for distributed systems. ACM Transactions on Computer Systems, 41(5):542–549, May 1992.

    Google Scholar 

  21. J. Saltzer, D. Reed, and D. Clark. End-to-end arguments in system design. ACM Transactions on Computer Systems, 2(4):277–288, Nov. 1984.

    Google Scholar 

  22. R. D. Schlichting and F. B. Schneider. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems, 1(3):222–238, Aug. 1983.

    Google Scholar 

  23. J.G. Silva, L.M. Silva, H. Madeira, and J. Bernardino. A fault-tolerant mechanism for simple controllers. In Echtle et al. [EHP94], pages 39–55.

    Google Scholar 

  24. Z. Segall, D. Vrsalovic, D. Siewiorek, D. Yaskin, J. Kownacki, J. Barton, D. Rancey, A. Robinson, and T. Lin. FIAT — Fault Injection based Automated Testing environment. [IEE88], pages 102–107.

    Google Scholar 

  25. A. Vrchoticky. The Basis for Static Execution Time Prediction. PhD thesis, Technisch-Naturwissenschaftliche Fakultät, Technische Universität Wien, Vienna, Austria, June 1994.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Andrzej Hlawiczka João Gabriel Silva Luca Simoncini

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fuchs, E. (1996). An evaluation of the error detection mechanisms in MARS using software-implemented fault injection. In: Hlawiczka, A., Silva, J.G., Simoncini, L. (eds) Dependable Computing — EDCC-2. EDCC 1996. Lecture Notes in Computer Science, vol 1150. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61772-8_31

Download citation

  • DOI: https://doi.org/10.1007/3-540-61772-8_31

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-61772-3

  • Online ISBN: 978-3-540-70677-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics