Abstract
The concept of fail-silent nodes greatly simplifies the design and safety proof of highly dependable fault-tolerant computer systems. The MAintainable Real-Time System (MARS) is a computer system where the hardware, operating system, and application level error detection mechanisms are designed to ensure the fail silence of nodes with a high probability.
The goal of this paper is two-fold: First, the error detection capabilities of the different mechanisms are evaluated in software-implemented fault injection experiments using the well-known bit-flip fault model. The results show that a fail silence coverage of at least 85% is achievable by the combination of hardware and system level software error detection mechanisms. With the additional use of application level error detection mechanisms a fail silence coverage of 100% was achieved.
Second, the limits of the application level error detection mechanisms are evaluated. In these experiments, the fault model consists of highly improbable residual faults to deliberately force the occurrence of fail silence violations. Despite this worst-case scenario, more than 50% of the presumed undetectable errors were detected by other mechanisms and hence did not lead to fail silence violations.
Preview
Unable to display preview. Download preview PDF.
References
F. Cristian, H. Aghili, R. Strong, and D. Dolev. Atomic broadcast: From simple message diffusion to byzantine agreement. In Proc. 15th Int. Symposium on Fault-Tolerant Computing, pages 200–206, Silver Spring, June 1985. IEEE Computer Society.
K. Echtle, D. Hammer, and D. Powell, editors. Dependable Computing-EDCC-1, First European Dependable Computing Conference, volume 852 of Lecture Notes in Computer Science, Berlin, Germany, Oct. 1994. Springer-Verlag.
K. Echtle and M. Leu. The EFA fault injector for fault-tolerant distributed system testing. In IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, pages 28–35, Amherst, Massachusetts, USA, July 1992.
S. Han, H. A. Rosenberg, and K. G. Shin. DOCTOR: An integrateD sOftware fault injeCTiOn enviRonment. In Third IEEE Int'l Workshop on Integrating Error Models with Fault Injection, Annapolis, Maryland, USA, April 1994.
IEEE Computer Society. Proc. 18th Int. Symposium on Fault-Tolerant Computing, Tokyo, Japan, June 1988.
R. Johansson. On single event upset error manifestation. In Echtle et al. [EHP94], pages 217–231.
J. Karlsson, P. Folkesson, Jean Arlat, Yves Crouzet, and Günther Leber. Integration and comparison of three physical fault injection techniques. In Prédictably Dependable Computing Systems, chapter V: Fault Injection, pages 309–329. Springer Verlag, 1995.
H. Kopetz, P. Holzer, G. Leber, and M. Schindler. The rolling ball on MARS. Research Report 13/91, Institut für Technische Informatik, Technische Universität Wien, Vienna, Austria, Nov. 1991.
W. Kao, R.K. Iyer, and D. Tang. FINE: A fault injection and monitoring environment for tracing the UNIX system behavior under faults. IEEE Transactions on Software Engineering, SE-19(11):1105–1118, Nov. 1993.
G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. FERRARI: A flexible software-based fault and error injection system. IEEE Transactions on Computers, 44(2):248–260, Feb. 1995.
H. Kopetz, H. Kantz, G. Grünsteidl, P. Puschner, and J. Reisinger. Tolerating Transient Faults in MARS. In Proc. 20th Int. Symposium on Fault-Tolerant Computing, pages 466–473, Newcastle upon Tyne, U.K., June 1990.
T. Lovric. Systematic and design diversity — software techniques for hardware fault detection. In Echtle et al. [EHP94], pages 309–326.
H. Madeira and J.G. Silva. Experimental evaluation of the fail-silent behavior in computers without error masking. In Proc. 24th Int. Symposium on Fault-Tolerant Computing, pages 350–359, Austin, Texas, USA, June 1994. IEEE Computer Society.
H.-J. Mathony, J. Unruh, and K.-H. Kaiser. On the data integrity in automotive networks. In Electronic Systems dor Vehicles, number 819 in VDI Berichte, pages 515–539. VDI Verlag, Düsseldorf, 1990.
D. Powell, G. Bonn, D. Seaton, P. Verissimo, and F. Waeselynck. The Delta-4 approach to dependability in open distributed computing systems. [IEE88], pages 246–151.
W.W. Peterson and E.J. Weldon. Error-Correcting Codes. The M.I.T. Press, 1972. (Second Edition).
J. Reisinger. Konzeption und Analyse eines zeitgesteuerten Betriebssystems für Echtzeitanwendungen. PhD thesis, Technisch-Naturwissenschaftliche Fakultät, Technische Universität Wien, Wien, Österreich, Juli 1993.
J. Reisinger and A. Steininger. The design of a fail-silent processing node for MARS. Distributed Systems Engineering Journal, 1994.
J. Reisinger, A. Steininger, and G. Leber. The PDCS implementation of MARS hardware and software. In Predictably Dependable Computing Systems, pages 209–224. Springer Verlag, 1995.
S.K. Shrivastava, P.D. Ezhilchelvan, N.A. Speirs, S. Tao, and A. Tully. Principal features of the VOLTAN family of reliable node architectures for distributed systems. ACM Transactions on Computer Systems, 41(5):542–549, May 1992.
J. Saltzer, D. Reed, and D. Clark. End-to-end arguments in system design. ACM Transactions on Computer Systems, 2(4):277–288, Nov. 1984.
R. D. Schlichting and F. B. Schneider. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems, 1(3):222–238, Aug. 1983.
J.G. Silva, L.M. Silva, H. Madeira, and J. Bernardino. A fault-tolerant mechanism for simple controllers. In Echtle et al. [EHP94], pages 39–55.
Z. Segall, D. Vrsalovic, D. Siewiorek, D. Yaskin, J. Kownacki, J. Barton, D. Rancey, A. Robinson, and T. Lin. FIAT — Fault Injection based Automated Testing environment. [IEE88], pages 102–107.
A. Vrchoticky. The Basis for Static Execution Time Prediction. PhD thesis, Technisch-Naturwissenschaftliche Fakultät, Technische Universität Wien, Vienna, Austria, June 1994.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1996 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fuchs, E. (1996). An evaluation of the error detection mechanisms in MARS using software-implemented fault injection. In: Hlawiczka, A., Silva, J.G., Simoncini, L. (eds) Dependable Computing — EDCC-2. EDCC 1996. Lecture Notes in Computer Science, vol 1150. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61772-8_31
Download citation
DOI: https://doi.org/10.1007/3-540-61772-8_31
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61772-3
Online ISBN: 978-3-540-70677-9
eBook Packages: Springer Book Archive