Abstract
Software-implemented approaches to fault-tolerance are very resilient to change since changes in hardware technology do not require extensive re-design of specialized hardware. This paper argues the case for implementing fault-tolerance in a distributed fashion and reports the approach adopted in the European Delta-4 project. Fault-tolerance is achieved by replicating capsules (the run-time representation of application objects) on distributed nodes interconnected by a local area network. Capsule groups can be configured to tolerate either stopping failures or arbitrary failures. Multipoint protocols are used for coordinating capsule groups and for error processing and fault treatment. The paper concludes with a critical analysis of the project's results.
This paper is based on an article that is to appear in IEEE Micro.
Preview
Unable to display preview. Download preview PDF.
References
J. Arlat, M. Aguera, Y. Crouzet, J. Fabre, E. Martins, D. Powell: Experimental Evaluation of the Fault Tolerance of an Atomic Multicast Protocol. IEEE Trans. Reliability, 39, 455–467, 1990.
J. Arlat, Y. Crouzet, E. Martins, D. Powell: Dependability Testing Report LA3 — Fault-Injection on the Extended Self-Checking NAC. LAAS-CNRS, Report, N∘91396, December 1991.
P. A. Barrett, A. M. Hilborne, P. G. Bond, D. T. Seaton, P. Veríssimo, L. Rodrigues, N. A. Speirs: The Delta-4 Extra Performance Architecture (XPA). Proc. 20th Int. Symp. on Fault-Tolerant Computing Systems (FTCS-20) (Newcastle upon Tyne, UK). IEEE Computer Society Press, 1990, pp. 481–488.
J. Bartlett, J. Gray, B. Horst: Fault Tolerance in Tandem Computer Systems. In: A. Avizienis, H. Kopetz, J.-C. Laprie (eds.): The Evolution of Fault-Tolerant Systems. Dependable Computing and Fault-Tolerant Systems (1). Vienna: Springer-Verlag, 1987, pp. 55–76.
D. Benson, B. Gilmore, D. Seaton: Delta-4 Application Support Environment. In: D. Powell (ed.): Delta-4: a Generic Architecture for Dependable Distributed Computing. Berlin, Germany: Springer Verlag, 1991, pp. 125–163.
K. P. Birman, T. A. Joseph: Reliable Communication in the Presence of Failures. ACM Trans. Computer Systems, 5, 47–76, 1987.
K. P. Birman, T. A. Joseph: Exploiting Replication in Distributed Systems. In: S. Mullender (ed.): Distributed Systems. New York: ACM Press, 1989, pp. 319–367.
K. P. Birman, A. Schiper, P. Stephenson: Lightweight Causal and Atomic Group Multicast. ACM Trans. Computer Systems, 9, 272–314, 1991.
A. Borg, J. Baumbach, S. Glazer: A Message System supporting Fault Tolerance. Proc. 9th Symp. on Operating System Principles. ACM, 1983, pp. 90–99.
L. Chen, A. Avizienis: N-Version-Programming: A Fault-Tolerance Approach to Reliability of Software Operation. Proc. 8th Int. Symp. on Fault-Tolerant Computing (FTCS-8) (Toulouse, France). IEEE Computer Society Press, 1978, pp. 3–9.
M. Chérèque, G. Bonn, U. Bügel, F. Kaiser, T. Usländer: Open System Architecture (OSA). In: D. Powell (ed.): Delta-4: a Generic Architecture for Dependable Distributed Computing. Research Reports ESPRIT. Berlin, Germany: Springer-Verlag, 1991, pp. 165–210.
M. Chérèque, D. Powell, P. Reynier, J.-L. Richier, J. Voiron: Active Replication in Delta-4. Proc. 22nd Int. Conf. on Fault-Tolerant Computing Systems (FTCS-22) (Boston, MA, USA). IEEE Computer Society Press, 1992, pp. 28–37.
E. C. Cooper: Replicated Procedure Call. ACM Op. Sys. Review, 20, 44–56, 1984.
F. Cristian, H. Aghali, R. Strong, D. Dolev: Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement. Proc. 15th Int. Symp. on Fault-Tolerant Computing (FTCS-15) (Ann Arbor, MI, USA). IEEE Computer Society Press, 1985, pp. 200–206.
F. Cristian, B. Dancey, J. Dehn: Fault-Tolerance in the Advanced Automation System. Proc. 20th Int. Symp. on Fault-Tolerant Computing (FTCS-20) (Newcastle upon Tyne, UK). IEEE Computer Society Press, 1990, pp. 6–17.
Delta-4: Process Replication — The Object Manager Entity (OME). System Administration, Implementation Guide/ Delta-4 Document, N∘I90.082/I3/P, December 1992.
M. Fischer: A Theoretician's View of Fault Tolerant Distributed Computing. In: B. Simons, A. Spector (eds.): Fault-Tolerant Distributed Computing. Lecture Notes on Computer Science (448). Berlin: Springer-Verlag, 1990, pp. 1–9.
J. Gray: Why do Computers Stop and What can be done about it? Proc. 5th Symp. on Reliability in Distributed Software and Database Systems (Los Angeles, CA, USA). IEEE Computer Society Press, 1986, pp. 3–12.
K. Kanoun, J. Arlat, L. Burrill, Y. Crouzet, S. Graf, E. Martins, A. MacInnes, D. Powell, J.-L. Richier, J. Voiron: Validation. In: D. Powell (ed.): Delta-4: a Generic Architecture for Dependable Distributed Computing. Berlin, Germany: Springer Verlag, 1991, pp. 371–406.
K. Kanoun, D. Powell: Dependability Evaluation of Bus and Ring Communication Topologies for the Delta-4 Distributed Fault-Tolerant Architecture. Proc. 10th Symp. on Reliable Distributed Systems (SRDS-10) (Pisa, Italy). IEEE Computer Society Press, 1991, pp. 130–141.
J.-C. Laprie (ed.): Dependability: Basic Concepts and Terminology. Dependable Computing and Fault-Tolerance (5). Vienna, Austria: Springer-Verlag, 1992.
I. Lee, R. K. Iyer: Faults, Symptoms and Software Fault Tolerance in the Tandem GUARDIAN90 Operating System. Proc. 23rd Int. Conf. on Fault-Tolerant Computing (FTCS-23) (Toulouse, France). IEEE Computer Society Press, 1993, pp. 20–29.
P. A. Lee, T. Anderson: Fault Tolerance — Principles and Practice. Dependable Computing and Fault-Tolerant Systems (3). Springer-Verlag, Vienna, Austria, 1990.
P. M. Melliar-Smith, R. L. Schwartz: Formal Specification and Mechanical Verification of SIFT: A Fault-Tolerance Flight Control System. IEEE Trans. Computers, C-31, 616–630, 1982.
S. Mullender (ed.): Distributed Systems. New York: ACM Press, Addison-Wesley, 1989.
D. Powell (ed.): Delta-4: a Generic Architecture for Dependable Distributed Computing. Research Reports ESPRIT. Berlin, Germany: Springer-Verlag, 1991.
D. Powell: Failure Mode Assumptions and Assumption Coverage. Proc. 22nd Int. Symp. on Fault-Tolerant Computing (FTCS-22) (Boston, MA, USA). IEEE Computer Society Press, 1992, pp. 386–395.
D. Powell, G. Bonn, D. Seaton, P. Veríssimo, F. Waeselynck: The Delta-4 Approach to Dependability in Open Distributed Computing Systems. Proc. 18th Int. Symp. on Fault-Tolerant Computing Systems (FTCS-18) (Tokyo, Japan). IEEE Computer Society Press, 1988, pp. 246–251.
B. Randell: System Structure for Software Fault Tolerance. IEEE Trans. Software Engineering, SE-1, 220–232, 1975.
L. Rodrigues, P. Veríssimo: xAMp: a Multi-Primitive Group Communications Service. Proc. 11th Symp. on Reliable Distributed Systems (SRDS-11) (Houston, TX, USA). IEEE Computer Society Press, 1992, pp. 112–121.
F. B. Schneider: Implementing Fault Tolerant Services using the State Machine Approach: a Tutorial. ACM Comp. Surveys, 22, 229–319, 1990.
N. A. Speirs, P. A. Barrett: Using Passive Replicates in Delta-4 to provide Dependable Distributed Computing. Proc. 19th Int. Symp. on Fault-Tolerant Computing Systems (FTCS-19) (Chicago, MI, U.S.A). IEEE Computer Society Press, 1989, pp. 184–190.
R. van Renesse, K. P. Birman, R. Cooper, B. Glade, P. Stephenson: Reliable Multicast between Microkernels. Proc. Workshop on Microkernels and Other Kernel Architectures (Seattle, WA, USA). USENIX Assocation, 1992, pp. 269–283.
P. Veríssimo: Redundant Media Mechanisms for Dependable Communication in Token-Bus LANs. Proc. 13th Local Computer Network Conf. (Minneapolis, MN, USA). IEEE Computer Society Press, 1988, pp. 453–462.
P. Veríssimo, P. Barrett, P. Bond, A. Hilborne, L. Rodrigues, D. Seaton: Extra Performance Architecture (XPA). In: D. Powell (ed.): Delta-4: a Generic Architecture for Dependable Distributed Computing. Research Reports ESPRIT. Berlin, Germany: Springer-Verlag, 1991, pp. 211–266.
P. Veríssimo, L. Rodrigues, J. Ruffino: The Atomic Multicast Protocol (AMp). In: D. Powell (ed.): Delta-4: a Generic Architecture for Dependable Distributed Computing. Research Reports ESPRIT. Berlin, Germany: Springer-Verlag, 1991, pp. 267–294.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1994 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Powell, D. (1994). Distributed fault tolerance — Lessons learnt from Delta-4. In: Banâtre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020035
Download citation
DOI: https://doi.org/10.1007/BFb0020035
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57767-6
Online ISBN: 978-3-540-48330-4
eBook Packages: Springer Book Archive