Skip to main content

Distributed fault tolerance — Lessons learnt from Delta-4

  • Software Architectures for Fault Tolerance
  • Conference paper
  • First Online:
Hardware and Software Architectures for Fault Tolerance (Fault Tolerance 1993)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 774))

Included in the following conference series:

Abstract

Software-implemented approaches to fault-tolerance are very resilient to change since changes in hardware technology do not require extensive re-design of specialized hardware. This paper argues the case for implementing fault-tolerance in a distributed fashion and reports the approach adopted in the European Delta-4 project. Fault-tolerance is achieved by replicating capsules (the run-time representation of application objects) on distributed nodes interconnected by a local area network. Capsule groups can be configured to tolerate either stopping failures or arbitrary failures. Multipoint protocols are used for coordinating capsule groups and for error processing and fault treatment. The paper concludes with a critical analysis of the project's results.

This paper is based on an article that is to appear in IEEE Micro.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. J. Arlat, M. Aguera, Y. Crouzet, J. Fabre, E. Martins, D. Powell: Experimental Evaluation of the Fault Tolerance of an Atomic Multicast Protocol. IEEE Trans. Reliability, 39, 455–467, 1990.

    Article  Google Scholar 

  2. J. Arlat, Y. Crouzet, E. Martins, D. Powell: Dependability Testing Report LA3 — Fault-Injection on the Extended Self-Checking NAC. LAAS-CNRS, Report, N∘91396, December 1991.

    Google Scholar 

  3. P. A. Barrett, A. M. Hilborne, P. G. Bond, D. T. Seaton, P. Veríssimo, L. Rodrigues, N. A. Speirs: The Delta-4 Extra Performance Architecture (XPA). Proc. 20th Int. Symp. on Fault-Tolerant Computing Systems (FTCS-20) (Newcastle upon Tyne, UK). IEEE Computer Society Press, 1990, pp. 481–488.

    Google Scholar 

  4. J. Bartlett, J. Gray, B. Horst: Fault Tolerance in Tandem Computer Systems. In: A. Avizienis, H. Kopetz, J.-C. Laprie (eds.): The Evolution of Fault-Tolerant Systems. Dependable Computing and Fault-Tolerant Systems (1). Vienna: Springer-Verlag, 1987, pp. 55–76.

    Google Scholar 

  5. D. Benson, B. Gilmore, D. Seaton: Delta-4 Application Support Environment. In: D. Powell (ed.): Delta-4: a Generic Architecture for Dependable Distributed Computing. Berlin, Germany: Springer Verlag, 1991, pp. 125–163.

    Google Scholar 

  6. K. P. Birman, T. A. Joseph: Reliable Communication in the Presence of Failures. ACM Trans. Computer Systems, 5, 47–76, 1987.

    Article  Google Scholar 

  7. K. P. Birman, T. A. Joseph: Exploiting Replication in Distributed Systems. In: S. Mullender (ed.): Distributed Systems. New York: ACM Press, 1989, pp. 319–367.

    Google Scholar 

  8. K. P. Birman, A. Schiper, P. Stephenson: Lightweight Causal and Atomic Group Multicast. ACM Trans. Computer Systems, 9, 272–314, 1991.

    Article  Google Scholar 

  9. A. Borg, J. Baumbach, S. Glazer: A Message System supporting Fault Tolerance. Proc. 9th Symp. on Operating System Principles. ACM, 1983, pp. 90–99.

    Google Scholar 

  10. L. Chen, A. Avizienis: N-Version-Programming: A Fault-Tolerance Approach to Reliability of Software Operation. Proc. 8th Int. Symp. on Fault-Tolerant Computing (FTCS-8) (Toulouse, France). IEEE Computer Society Press, 1978, pp. 3–9.

    Google Scholar 

  11. M. Chérèque, G. Bonn, U. Bügel, F. Kaiser, T. Usländer: Open System Architecture (OSA). In: D. Powell (ed.): Delta-4: a Generic Architecture for Dependable Distributed Computing. Research Reports ESPRIT. Berlin, Germany: Springer-Verlag, 1991, pp. 165–210.

    Google Scholar 

  12. M. Chérèque, D. Powell, P. Reynier, J.-L. Richier, J. Voiron: Active Replication in Delta-4. Proc. 22nd Int. Conf. on Fault-Tolerant Computing Systems (FTCS-22) (Boston, MA, USA). IEEE Computer Society Press, 1992, pp. 28–37.

    Google Scholar 

  13. E. C. Cooper: Replicated Procedure Call. ACM Op. Sys. Review, 20, 44–56, 1984.

    Article  Google Scholar 

  14. F. Cristian, H. Aghali, R. Strong, D. Dolev: Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement. Proc. 15th Int. Symp. on Fault-Tolerant Computing (FTCS-15) (Ann Arbor, MI, USA). IEEE Computer Society Press, 1985, pp. 200–206.

    Google Scholar 

  15. F. Cristian, B. Dancey, J. Dehn: Fault-Tolerance in the Advanced Automation System. Proc. 20th Int. Symp. on Fault-Tolerant Computing (FTCS-20) (Newcastle upon Tyne, UK). IEEE Computer Society Press, 1990, pp. 6–17.

    Google Scholar 

  16. Delta-4: Process Replication — The Object Manager Entity (OME). System Administration, Implementation Guide/ Delta-4 Document, N∘I90.082/I3/P, December 1992.

    Google Scholar 

  17. M. Fischer: A Theoretician's View of Fault Tolerant Distributed Computing. In: B. Simons, A. Spector (eds.): Fault-Tolerant Distributed Computing. Lecture Notes on Computer Science (448). Berlin: Springer-Verlag, 1990, pp. 1–9.

    Google Scholar 

  18. J. Gray: Why do Computers Stop and What can be done about it? Proc. 5th Symp. on Reliability in Distributed Software and Database Systems (Los Angeles, CA, USA). IEEE Computer Society Press, 1986, pp. 3–12.

    Google Scholar 

  19. K. Kanoun, J. Arlat, L. Burrill, Y. Crouzet, S. Graf, E. Martins, A. MacInnes, D. Powell, J.-L. Richier, J. Voiron: Validation. In: D. Powell (ed.): Delta-4: a Generic Architecture for Dependable Distributed Computing. Berlin, Germany: Springer Verlag, 1991, pp. 371–406.

    Google Scholar 

  20. K. Kanoun, D. Powell: Dependability Evaluation of Bus and Ring Communication Topologies for the Delta-4 Distributed Fault-Tolerant Architecture. Proc. 10th Symp. on Reliable Distributed Systems (SRDS-10) (Pisa, Italy). IEEE Computer Society Press, 1991, pp. 130–141.

    Google Scholar 

  21. J.-C. Laprie (ed.): Dependability: Basic Concepts and Terminology. Dependable Computing and Fault-Tolerance (5). Vienna, Austria: Springer-Verlag, 1992.

    Google Scholar 

  22. I. Lee, R. K. Iyer: Faults, Symptoms and Software Fault Tolerance in the Tandem GUARDIAN90 Operating System. Proc. 23rd Int. Conf. on Fault-Tolerant Computing (FTCS-23) (Toulouse, France). IEEE Computer Society Press, 1993, pp. 20–29.

    Google Scholar 

  23. P. A. Lee, T. Anderson: Fault Tolerance — Principles and Practice. Dependable Computing and Fault-Tolerant Systems (3). Springer-Verlag, Vienna, Austria, 1990.

    Google Scholar 

  24. P. M. Melliar-Smith, R. L. Schwartz: Formal Specification and Mechanical Verification of SIFT: A Fault-Tolerance Flight Control System. IEEE Trans. Computers, C-31, 616–630, 1982.

    Google Scholar 

  25. S. Mullender (ed.): Distributed Systems. New York: ACM Press, Addison-Wesley, 1989.

    Google Scholar 

  26. D. Powell (ed.): Delta-4: a Generic Architecture for Dependable Distributed Computing. Research Reports ESPRIT. Berlin, Germany: Springer-Verlag, 1991.

    Google Scholar 

  27. D. Powell: Failure Mode Assumptions and Assumption Coverage. Proc. 22nd Int. Symp. on Fault-Tolerant Computing (FTCS-22) (Boston, MA, USA). IEEE Computer Society Press, 1992, pp. 386–395.

    Google Scholar 

  28. D. Powell, G. Bonn, D. Seaton, P. Veríssimo, F. Waeselynck: The Delta-4 Approach to Dependability in Open Distributed Computing Systems. Proc. 18th Int. Symp. on Fault-Tolerant Computing Systems (FTCS-18) (Tokyo, Japan). IEEE Computer Society Press, 1988, pp. 246–251.

    Google Scholar 

  29. B. Randell: System Structure for Software Fault Tolerance. IEEE Trans. Software Engineering, SE-1, 220–232, 1975.

    Google Scholar 

  30. L. Rodrigues, P. Veríssimo: xAMp: a Multi-Primitive Group Communications Service. Proc. 11th Symp. on Reliable Distributed Systems (SRDS-11) (Houston, TX, USA). IEEE Computer Society Press, 1992, pp. 112–121.

    Google Scholar 

  31. F. B. Schneider: Implementing Fault Tolerant Services using the State Machine Approach: a Tutorial. ACM Comp. Surveys, 22, 229–319, 1990.

    Google Scholar 

  32. N. A. Speirs, P. A. Barrett: Using Passive Replicates in Delta-4 to provide Dependable Distributed Computing. Proc. 19th Int. Symp. on Fault-Tolerant Computing Systems (FTCS-19) (Chicago, MI, U.S.A). IEEE Computer Society Press, 1989, pp. 184–190.

    Google Scholar 

  33. R. van Renesse, K. P. Birman, R. Cooper, B. Glade, P. Stephenson: Reliable Multicast between Microkernels. Proc. Workshop on Microkernels and Other Kernel Architectures (Seattle, WA, USA). USENIX Assocation, 1992, pp. 269–283.

    Google Scholar 

  34. P. Veríssimo: Redundant Media Mechanisms for Dependable Communication in Token-Bus LANs. Proc. 13th Local Computer Network Conf. (Minneapolis, MN, USA). IEEE Computer Society Press, 1988, pp. 453–462.

    Google Scholar 

  35. P. Veríssimo, P. Barrett, P. Bond, A. Hilborne, L. Rodrigues, D. Seaton: Extra Performance Architecture (XPA). In: D. Powell (ed.): Delta-4: a Generic Architecture for Dependable Distributed Computing. Research Reports ESPRIT. Berlin, Germany: Springer-Verlag, 1991, pp. 211–266.

    Google Scholar 

  36. P. Veríssimo, L. Rodrigues, J. Ruffino: The Atomic Multicast Protocol (AMp). In: D. Powell (ed.): Delta-4: a Generic Architecture for Dependable Distributed Computing. Research Reports ESPRIT. Berlin, Germany: Springer-Verlag, 1991, pp. 267–294.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Michel Banâtre Peter A. Lee

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Powell, D. (1994). Distributed fault tolerance — Lessons learnt from Delta-4. In: Banâtre, M., Lee, P.A. (eds) Hardware and Software Architectures for Fault Tolerance. Fault Tolerance 1993. Lecture Notes in Computer Science, vol 774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0020035

Download citation

  • DOI: https://doi.org/10.1007/BFb0020035

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-57767-6

  • Online ISBN: 978-3-540-48330-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics