Skip to main content

Fault-tolerance by replication in distributed systems

  • Invited Papers
  • Conference paper
  • First Online:
Reliable Software Technologies — Ada-Europe '96 (Ada-Europe 1996)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1088))

Included in the following conference series:

Abstract

The paper is a tutorial on fault-tolerance by replication in distributed systems. We start by defining linearizability as the correctness criterion for replicated services (or objects), and present the two main classes of replication techniques: primary-backup replication and active replication. We introduce group communication as the infrastructure providing the adequate multicast primitives to implement either primary-backup replication, or active replication. Finally, we discuss the implementation of the two most fundamental group multicast primitives: total order multicast and view synchronous multicast.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. El Abbadi and S. Toueg. Maintaining Availability in Partitioned Replicated Databases. ACM Trans. on Database Systems, 14(2):264–290, June 1989.

    Google Scholar 

  2. M. Ahamad, P.W. Hutto, G. Neiger, J.E. Burns, and P. Kohli. Causal Memory: Definitions, Implementations and Programming. TR GIT-CC-93/55, Georgia Institute of Technology, July 94.

    Google Scholar 

  3. Y. Amir, L.E. Moser, P.M. Melliar-Smith, D.A. Agarwal, and P. Ciarfella. Fast Message Ordering and Membership Using a Logical Token-Passing Ring. In IEEE 13th Intl. Conf. Distributed Computing Systems, pages 551–560, May 1993.

    Google Scholar 

  4. E. Auceaume. Algorithmique de Fiabilisation de Systèmes Répartis. PhD thesis, Université de Paris-Sud, Centre d'Orsay, January 1993.

    Google Scholar 

  5. K. Birman. The Process Group Approach to Reliable Distributed Computing. Comm. ACM, 36(12):37–53, December 1993.

    Google Scholar 

  6. K. Birman and T. Joseph. Reliable Communication in the Presence of Failures. ACM Trans. on Computer Systems, 5(1):47–76, February 1987.

    Google Scholar 

  7. K. Birman, A. Schiper, and P. Stephenson. Lightweight Causal and Atomic Group Multicast. ACM Trans. on Computer Systems, 9(3):272–314, August 1991.

    Google Scholar 

  8. N. Budhiraja, K. Marzullo, F.B. Schneider, and S. Toueg. The Primary-Backup Approach. In Sape Mullender, editor, Distributed Systems, pages 199–216. ACM Press, 1993.

    Google Scholar 

  9. T.D. Chandra, V. Hadzilacos, and S. Toueg. The Weakest Failure Detector for Solving Consensus. Technical report, Department of Computer Science, Cornell University, May 1994. A preliminary version appeared in the Proceedings of the Eleventh ACM Symposium on Principles of Distributed Computing, pages 147–158. ACM Press, August 1992.

    Google Scholar 

  10. T.D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Technical Report 95-1535, Department of Computer Science, Cornell University, August 1995. A preliminary version appeared in the Proceedings of the Tenth ACM Symposium on Principles of Distributed Computing, pages 325–340. ACM Press, August 1991.

    Google Scholar 

  11. J. M. Chang and N. Maxemchuck. Reliable Broadcast Protocols. ACM Trans. on Computer Systems, 2(3):251–273, August 1984.

    Google Scholar 

  12. D. Davcec and A. Burkhard. Consistency and Recovery Control for Replicated Files. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 87–96, 1985.

    Google Scholar 

  13. M. Fischer, N. Lynch, and M. Paterson. Impossibility of Distributed Consensus with One Faulty Process. Journal of ACM, 32:374–382, April 1985.

    Google Scholar 

  14. D.K. Gifford. Weighted Voting for Replicated Data. In Proceedings of the 7th Symposium on Operating Systems Principles, pages 150–159, December 1979.

    Google Scholar 

  15. N. Goodmand, D. Skeen, A. Chan, U. Dayal, S. Fox, and D. Ries. A recovery algorithm for a distributed database system. In Proc. of the 2nd ACM SIGATC-SIGMOD Symposium on Principles of Database Systems, March 1983.

    Google Scholar 

  16. A. S. Gopal. Fault-Tolerant Broadcast and Multicasts: The Problem of Inconsistency and Contamination. PhD thesis, Cornell University, Ithaca, NY, March 1992.

    Google Scholar 

  17. R. Guerraoui. Revisiting the relationship between non-blocking atomic commitment and consensus. In 9th Intl. Workshop on Distributed Algorithms (WDAG-9), pages 87–100. Springer Verlag, LNCS 972, September 1995.

    Google Scholar 

  18. R. Guerraoui and A. Schiper. Transaction model vs Virtual Synchrony Model: bridging the gap. In Theory and Practice in Distributed Systems, pages 121–132. Springer Verlag, LNCS 938, 1995.

    Google Scholar 

  19. V. Hadzilacos and S. Toueg. Fault-Tolerant Broadcasts and Related Problems. In Sape Mullender, editor, Distributed Systems, pages 97–145. ACM Press, 1993.

    Google Scholar 

  20. M. Herlihy. A Quorum-Consensus Replication Method for Abstract Data Types. ACM Trans. on Computer Systems, 4(1):32–53, February 1986.

    Google Scholar 

  21. M. Herlihy and J. Wing. Linearizability: a correctness condition for concurrent objects. ACM Trans. on Progr. Languages and Syst, 12(3):463–492, 1990.

    Google Scholar 

  22. M. F. Kaashoek and A. S. Tanenbaum. Group Communication in the Amoeba Distributed Operating System. In IEEE 11th Intl. Conf. Distributed Computing Systems, pages 222–230, May 1991.

    Google Scholar 

  23. M. F. Kaashoek, A. S. Tanenbaum, S. F. Hummel, and H. E. Bal. An Efficient Reliable Broadcast Protocol. Operating Systems Review, 23(4):5–19, October 1989.

    Google Scholar 

  24. L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. on Computers, C28(9):690–691, 1979.

    Google Scholar 

  25. S. W. Luan and V. D. Gligor. A Fault-Tolerant Protocol for Atomic Broadcast. IEEE Trans. Parallel & Distributed Syst., 1(3):271–285, July 90.

    Google Scholar 

  26. D. Malki, Y. Amir, D. Dolev, and S. Kramer. The Transis approach to high available cluster communication. Technical Report CS-94-14, Institute of Computer Science, The Hebrew University of Jerusalem, 1994.

    Google Scholar 

  27. C. Malloth. Conception and Implementation of a Toolkit for Building Fault-Tolerant Distributed Applications in Large Scale Networks. PhD thesis, Federal Institute of Technology, Lausanne (EPFL), 1996. To appear.

    Google Scholar 

  28. S. Mishra, L.L. Peterson, and R. D. Schlichting. Consul: a communication substrate for fault-tolerant distributed programs. Distributed Systems Engineering Journal, 1:87–103, 1993.

    Google Scholar 

  29. L. Moser, Y. Amir, P. Melliar-Smith, and D. Agarwal. Extended Virtual Synchrony. In IEEE 14th Intl. Conf. Distributed Computing Systems, pages 56–67, June 1994.

    Google Scholar 

  30. D. Powell, editor. Delta-4: A Generic Architecture for Dependable Distributed Computing. Springer-Verlag, 1991.

    Google Scholar 

  31. A. M. Ricciardi and K. P. Birman. Using Process Groups to Implement Failure Detection in Asynchronous Environments. In Proc. of the 10th ACM Symposium on Principles of Distributed Computing, pages 341–352, August 1991.

    Google Scholar 

  32. A. Schiper and A. Sandoz. Uniform Reliable Multicast in a Virtually Synchronous Environment. In IEEE 13th Intl. Conf. Distributed Computing Systems, pages 561–568, May 1993.

    Google Scholar 

  33. F.B. Schneider. Replication Management using the State-Machine Approach. In Sape Mullender, editor, Distributed Systems, pages 169–197. ACM Press, 1993.

    Google Scholar 

  34. R. van Renesse, K. Birman, R. Cooper, B. Glade, and P. Stephenson. The Horus System. In K. Birman and R. van Renesse, editors, Reliable Distributed Computing with the Isis Toolkit, pages 133–147. IEEE Computer Society Press, 1993.

    Google Scholar 

  35. U. Wilhelm and A. Schiper. A Hierarchy of Totally Ordered Multicasts. In 14th IEEE Symp. on Reliable Distributed Systems (SRDS-14), pages 106–115, Bad Neuenahr, Germany, September 1995.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Alfred Strohmeier

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Guerraoui, R., Schiper, A. (1996). Fault-tolerance by replication in distributed systems. In: Strohmeier, A. (eds) Reliable Software Technologies — Ada-Europe '96. Ada-Europe 1996. Lecture Notes in Computer Science, vol 1088. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0013477

Download citation

  • DOI: https://doi.org/10.1007/BFb0013477

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-61317-6

  • Online ISBN: 978-3-540-68457-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics