Skip to main content

The “Engineering” of fault-tolerant distributed computing systems

  • Systems Session II
  • Conference paper
  • First Online:
Fault-Tolerant Distributed Computing

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 448))

  • 122 Accesses

Abstract

We view the design of fault-tolerant computing systems as an engineering endeavor. As such, this activity requires understanding the theoretical limitations and the scope of the feasible designs. We survey the impact that various environment characteristics and design choices have on the resultant system properties. We propose a single metric—the system reliability—as an appropriate measure for exploring tradeoffs among a potentially-large design space.

Partial support for this work was provided by the National Science Foundation under Grant DCR-86-01864 and AT&T under a Foundation Grant.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ö. Babaog∼lu, Stopping times of distributed consensus protocols: a probabilistic analysis. Information Processing Letters, vol. 25, no. 3, pp. 163–169, (May 1987).

    Article  Google Scholar 

  2. Ö. Babaog∼lu, On the reliability of consensus-based fault-tolerant distributed computing systems. ACM Trans. on Computer Systems, vol. 5, no. 3, pp. 394–416.

    Google Scholar 

  3. Ö. Babaog∼lu and R. Drummond, Streets of Byzantium: Network architectures for fast reliable broadcasts. IEEE Trans. Software Eng., vol. SE-11, no. 6, pp. 546–554, June 1985.

    MathSciNet  Google Scholar 

  4. Ö. Babaog∼lu, P. Stephenson and R. Drummond, Reliable broadcast protocols and communication models: Tradeoffs and lower bounds. Springer-Verlag Distributed Computing, (to appear).

    Google Scholar 

  5. W. Diffie and M. Hellman, New directions in cryptography. IEEE Trans. on Inf. Theory, vol. IT-22, pp. 644–654, 1976.

    Article  MathSciNet  Google Scholar 

  6. D. Dolev, The Byzantine Generals strike again. Journal of Algorithms, vol. 3, no. 1, pp. 14–30, 1982.

    Article  MATH  MathSciNet  Google Scholar 

  7. D. Dolev, C. Dwork and L. Stockmeyer, On the minimal synchronism needed for distributed consensus. Journal of the ACM, vol. 34, no. 1, pp. 77–97, January 1987.

    Article  MATH  MathSciNet  Google Scholar 

  8. D. Dolev and H. R. Strong, Authenticated algorithms for Byzantine Agreement. SIAM J. Comput., vol. 12, no. 4, pp. 656–666, November 1983.

    Article  MATH  MathSciNet  Google Scholar 

  9. M. J. Fischer, The consensus problem in unreliable distributed systems (A Brief Survey). Tech. Rep. YALEU-DCS-RR-273, Dept. of Computer Science, Yale University, New Haven, Connecticut, June 1983.

    Google Scholar 

  10. Fischer, M. and Lynch, N. A lower bound for the time to assure interactive consistency. Inform. Proc. Letters 14, no. 4, pp. 183–186, April 1982.

    Article  Google Scholar 

  11. M.J. Fischer, N.A. Lynch and M.S. Paterson, Impossibility of distributed consensus with one faulty process. Journal of the ACM, vol. 32, no. 2, pp. 374–382, April 1985.

    Article  MATH  MathSciNet  Google Scholar 

  12. H. Garcia-Molina, F. Pittelli and S. Davidson, Applications of Byzantine Agreement in database systems. Tech. Rep. TR 316, Princeton University, Princeton, New Jersey, June 1984.

    Google Scholar 

  13. V. Hadzilacos, Issues of fault tolerance in concurrent computations. Ph.D. Thesis, Tech. Rep. TR-11-84, Aiken Computation Laboratory, Harvard University, Cambridge, Mass., June, 1984.

    Google Scholar 

  14. V. Hadzilacos, Connectivity requirements for Byzantine Agreement under restricted types of failures. Springer-Verlag Distributed Computing, (to appear).

    Google Scholar 

  15. W. Kim, Highly available systems for database applications. ACM Computing Surveys, vol. 16, no. 1, pp. 71–98, March 1984.

    Article  Google Scholar 

  16. L. Lamport, Using time instead of timeout for fault-tolerant distributed systems. ACM Trans. on Programming Languages and Systems, vol. 6, no. 2, pp. 254–280, April 1984.

    Article  Google Scholar 

  17. R. Metcalfe and D.R. Boggs, Ethernet: Distributed packet switching for local computer networks. Commun. ACM, vol. 19, no. 7, pp. 396–403, July 1976.

    Article  Google Scholar 

  18. G. Neiger and S. Toueg, Automatically increasing the fault-tolerance of distributed systems. Proc. of the 7th ACM Symposium on Principles of Distributed Computing, Toronto, Canada, August 1988. (to appear)

    Google Scholar 

  19. M. Pease, R. Shostak and L. Lamport, Reaching agreement in the presence of faults. Journal of the ACM, vol. 27, no. 2, pp. 228–234, April 1980.

    Article  MATH  MathSciNet  Google Scholar 

  20. K.J. Perry and S. Toueg, Distributed agreement in the presence of processor and communication faults. IEEE Trans. on Software Engineering, vol. SE-12, no. 3, pp. 477–482, March 1986.

    Google Scholar 

  21. B. Randell, P.A. Lee, and P.C. Treleaven, Reliability issues in computing system design. ACM Computing Surveys, vol. 10, no. 2, pp. 123–166, June 1978.

    Article  MATH  Google Scholar 

  22. F. B. Schneider, Synchronization in distributed programs. ACM Trans. Programming Languages and Systems, vol. 4, pp. 125–148, April 1982.

    Article  MATH  Google Scholar 

  23. F. B. Schneider, The state machine approach: A tutorial. This volume.

    Google Scholar 

  24. F. B. Schneider and L. Lamport, Paradigms for distributed programs. In Distributed Systems: Methods and Tools for Specification, Paul, M. and Siegert H.J. (Eds.), Springer-Verlag Lecture Notes in Computer Science Vol. 190.

    Google Scholar 

  25. D. P. Siewiorek and R. S. Swarz, The Theory and Practice of Reliable System Design. Digital Press, Belford, Mass. (1982).

    Google Scholar 

  26. A. Z. Spector, Computer software for process control. Scientific American, vol. 251, no. 3, pp. 174–187, September 1984.

    Article  MathSciNet  Google Scholar 

  27. Stallings, W. Local networks. ACM Computing Surveys, vol. 16, no. 1, pp. 3–41, March 1984.

    Article  MathSciNet  Google Scholar 

  28. H.R. Strong and D. Dolev, Byzantine agreement. In Digest of Papers, Spring Compcon 83, San Francisco, California, pp. 77–81, March 1983.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Barbara Simons Alfred Spector

Rights and permissions

Reprints and permissions

Copyright information

© 1990 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Babaoglu, Ö. (1990). The “Engineering” of fault-tolerant distributed computing systems. In: Simons, B., Spector, A. (eds) Fault-Tolerant Distributed Computing. Lecture Notes in Computer Science, vol 448. Springer, New York, NY. https://doi.org/10.1007/BFb0042341

Download citation

  • DOI: https://doi.org/10.1007/BFb0042341

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-0-387-97385-2

  • Online ISBN: 978-0-387-34812-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics