An evaluation of the error detection mechanisms in MARS using software-implemented fault injection

Fuchs, Emmerich

doi:10.1007/3-540-61772-8_31

Emmerich Fuchs¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1150))

Included in the following conference series:

European Dependable Computing Conference

194 Accesses
9 Citations

Abstract

The concept of fail-silent nodes greatly simplifies the design and safety proof of highly dependable fault-tolerant computer systems. The MAintainable Real-Time System (MARS) is a computer system where the hardware, operating system, and application level error detection mechanisms are designed to ensure the fail silence of nodes with a high probability.

The goal of this paper is two-fold: First, the error detection capabilities of the different mechanisms are evaluated in software-implemented fault injection experiments using the well-known bit-flip fault model. The results show that a fail silence coverage of at least 85% is achievable by the combination of hardware and system level software error detection mechanisms. With the additional use of application level error detection mechanisms a fail silence coverage of 100% was achieved.

Second, the limits of the application level error detection mechanisms are evaluated. In these experiments, the fault model consists of highly improbable residual faults to deliberately force the occurrence of fail silence violations. Despite this worst-case scenario, more than 50% of the presumed undetectable errors were detected by other mechanisms and hence did not lead to fail silence violations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

F. Cristian, H. Aghili, R. Strong, and D. Dolev. Atomic broadcast: From simple message diffusion to byzantine agreement. In Proc. 15th Int. Symposium on Fault-Tolerant Computing, pages 200–206, Silver Spring, June 1985. IEEE Computer Society.
Google Scholar
K. Echtle, D. Hammer, and D. Powell, editors. Dependable Computing-EDCC-1, First European Dependable Computing Conference, volume 852 of Lecture Notes in Computer Science, Berlin, Germany, Oct. 1994. Springer-Verlag.
Google Scholar
K. Echtle and M. Leu. The EFA fault injector for fault-tolerant distributed system testing. In IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, pages 28–35, Amherst, Massachusetts, USA, July 1992.
Google Scholar
S. Han, H. A. Rosenberg, and K. G. Shin. DOCTOR: An integrateD sOftware fault injeCTiOn enviRonment. In Third IEEE Int'l Workshop on Integrating Error Models with Fault Injection, Annapolis, Maryland, USA, April 1994.
Google Scholar
IEEE Computer Society. Proc. 18th Int. Symposium on Fault-Tolerant Computing, Tokyo, Japan, June 1988.
Google Scholar
R. Johansson. On single event upset error manifestation. In Echtle et al. [EHP94], pages 217–231.
Google Scholar
J. Karlsson, P. Folkesson, Jean Arlat, Yves Crouzet, and Günther Leber. Integration and comparison of three physical fault injection techniques. In Prédictably Dependable Computing Systems, chapter V: Fault Injection, pages 309–329. Springer Verlag, 1995.
Google Scholar
H. Kopetz, P. Holzer, G. Leber, and M. Schindler. The rolling ball on MARS. Research Report 13/91, Institut für Technische Informatik, Technische Universität Wien, Vienna, Austria, Nov. 1991.
Google Scholar
W. Kao, R.K. Iyer, and D. Tang. FINE: A fault injection and monitoring environment for tracing the UNIX system behavior under faults. IEEE Transactions on Software Engineering, SE-19(11):1105–1118, Nov. 1993.
Google Scholar
G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. FERRARI: A flexible software-based fault and error injection system. IEEE Transactions on Computers, 44(2):248–260, Feb. 1995.
Google Scholar
H. Kopetz, H. Kantz, G. Grünsteidl, P. Puschner, and J. Reisinger. Tolerating Transient Faults in MARS. In Proc. 20th Int. Symposium on Fault-Tolerant Computing, pages 466–473, Newcastle upon Tyne, U.K., June 1990.
Google Scholar
T. Lovric. Systematic and design diversity — software techniques for hardware fault detection. In Echtle et al. [EHP94], pages 309–326.
Google Scholar
H. Madeira and J.G. Silva. Experimental evaluation of the fail-silent behavior in computers without error masking. In Proc. 24th Int. Symposium on Fault-Tolerant Computing, pages 350–359, Austin, Texas, USA, June 1994. IEEE Computer Society.
Google Scholar
H.-J. Mathony, J. Unruh, and K.-H. Kaiser. On the data integrity in automotive networks. In Electronic Systems dor Vehicles, number 819 in VDI Berichte, pages 515–539. VDI Verlag, Düsseldorf, 1990.
Google Scholar
D. Powell, G. Bonn, D. Seaton, P. Verissimo, and F. Waeselynck. The Delta-4 approach to dependability in open distributed computing systems. [IEE88], pages 246–151.
Google Scholar
W.W. Peterson and E.J. Weldon. Error-Correcting Codes. The M.I.T. Press, 1972. (Second Edition).
Google Scholar
J. Reisinger. Konzeption und Analyse eines zeitgesteuerten Betriebssystems für Echtzeitanwendungen. PhD thesis, Technisch-Naturwissenschaftliche Fakultät, Technische Universität Wien, Wien, Österreich, Juli 1993.
Google Scholar
J. Reisinger and A. Steininger. The design of a fail-silent processing node for MARS. Distributed Systems Engineering Journal, 1994.
Google Scholar
J. Reisinger, A. Steininger, and G. Leber. The PDCS implementation of MARS hardware and software. In Predictably Dependable Computing Systems, pages 209–224. Springer Verlag, 1995.
Google Scholar
S.K. Shrivastava, P.D. Ezhilchelvan, N.A. Speirs, S. Tao, and A. Tully. Principal features of the VOLTAN family of reliable node architectures for distributed systems. ACM Transactions on Computer Systems, 41(5):542–549, May 1992.
Google Scholar
J. Saltzer, D. Reed, and D. Clark. End-to-end arguments in system design. ACM Transactions on Computer Systems, 2(4):277–288, Nov. 1984.
Google Scholar
R. D. Schlichting and F. B. Schneider. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems, 1(3):222–238, Aug. 1983.
Google Scholar
J.G. Silva, L.M. Silva, H. Madeira, and J. Bernardino. A fault-tolerant mechanism for simple controllers. In Echtle et al. [EHP94], pages 39–55.
Google Scholar
Z. Segall, D. Vrsalovic, D. Siewiorek, D. Yaskin, J. Kownacki, J. Barton, D. Rancey, A. Robinson, and T. Lin. FIAT — Fault Injection based Automated Testing environment. [IEE88], pages 102–107.
Google Scholar
A. Vrchoticky. The Basis for Static Execution Time Prediction. PhD thesis, Technisch-Naturwissenschaftliche Fakultät, Technische Universität Wien, Vienna, Austria, June 1994.
Google Scholar

Download references

Author information

Authors and Affiliations

Daimler-Benz AG, Research and Technology, Berlin in cooperation with Institut für Technische Informatik, Technische Universität Wien, Austria
Emmerich Fuchs

Authors

Emmerich Fuchs
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Andrzej Hlawiczka João Gabriel Silva Luca Simoncini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fuchs, E. (1996). An evaluation of the error detection mechanisms in MARS using software-implemented fault injection. In: Hlawiczka, A., Silva, J.G., Simoncini, L. (eds) Dependable Computing — EDCC-2. EDCC 1996. Lecture Notes in Computer Science, vol 1150. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61772-8_31

Download citation

DOI: https://doi.org/10.1007/3-540-61772-8_31
Published: 06 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61772-3
Online ISBN: 978-3-540-70677-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics