Abstract
An accurate system-state determination is essential in ensuring system dependability. An imprecise state assessment can lead to catastrophic failure through optimistic diagnosis, or underutilization of resources due to pessimistic diagnosis. Dependability is usually achieved through a fault detection, isolation and reconfiguration (FDIR) paradigm, of which the diagnosis procedure is a primary component. Fault resolution in on-line diagnosis is key to providing an accurate system-state assessment. Most diagnostic strategies are based on limited fault models that adopt either an optimistic (all faults s-a-X) or pessimistic (all faults Byzantine) bias. Our Hybrid Fault-Effects Model (HFM) handles a continuum of fault types that are distinguished by their impact on system operations. While this approach has been shown to enhance system functionality and dependability, on-line FDIR is required to make the HFM practical. In this paper, we develop a methodology for utilization of the system-state information to provide continual on-line diagnosis and reconfiguration as an integral part of the system operations. We present diagnosis algorithms applicable under the generalized HFM and introduce the notion of fault decay time. Our diagnosis approach is based primarily on monitoring the system’s message traffic. Unlike existing approaches, no explicit test procedures are required.
Work supported in part by ONR # N00014-91-C-0014
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
M. Barborak et al. The consensus problem in fault-tolerant computing. ACM Computing surveys, 25(2):171–220, June 1993.
L. Lamport et al. The byzantine generals problem. ACM Trans. on Prog. Languages and Systems, 4:382–401, July 1982.
P. Lincoln and J. Rushby. A formally verified algorithm for interactive consistency under a hybrid fault model. FTCS-23, pages 402–411, 1993.
F. Preparata, G. Metze, and R. T. Chien. On the connection assignment problem of diagnosable systems. IEEE Trans. on Electronic Computing, ec-16:848–854, Dec 1967.
N. Suri et al. Reliability modeling of large fault-tolerant systems. FTCS-22, pages 212–220, 1992.
A. Sengupta and A. Dahbura. On self-diagnosable multiprocessor systems: diagnosis by the comparison approach. IEEE TOC, 41(11):1386–1396, Nov 1992.
K. G. Shin and P. Ramanathan. Diagnosis of processors with byzantine faults in a distributed computing system. FTCS-17, pages 55–60, June 1987.
P. Thambidurai and Y. K. Park. Interactive consistency with multiple failure modes. Proc. of RDS, pages 93-100, 1988.
C. J. Walter et al. MAFT: A multicomputer architecture for fault-tolerance in real-time control systems. RTSS, Dec 1985.
C. J. Walter. Identifying the cause of detected errors. FTCS-20, June 1990.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1995 Springer-Verlag/Wien
About this paper
Cite this paper
Walter, C.J., Suri, N., Hugue, M.M. (1995). Continual On-Line Diagnosis of Hybrid Faults. In: Cristian, F., Le Lann, G., Lunt, T. (eds) Dependable Computing for Critical Applications 4. Dependable Computing and Fault-Tolerant Systems, vol 9. Springer, Vienna. https://doi.org/10.1007/978-3-7091-9396-9_21
Download citation
DOI: https://doi.org/10.1007/978-3-7091-9396-9_21
Publisher Name: Springer, Vienna
Print ISBN: 978-3-7091-9398-3
Online ISBN: 978-3-7091-9396-9
eBook Packages: Springer Book Archive