The “Engineering” of fault-tolerant distributed computing systems

Babaoglu, Özalp

doi:10.1007/BFb0042341

Özalp Babaoglu¹^nAff2

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 448))

122 Accesses

Abstract

We view the design of fault-tolerant computing systems as an engineering endeavor. As such, this activity requires understanding the theoretical limitations and the scope of the feasible designs. We survey the impact that various environment characteristics and design choices have on the resultant system properties. We propose a single metric—the system reliability—as an appropriate measure for exploring tradeoffs among a potentially-large design space.

Partial support for this work was provided by the National Science Foundation under Grant DCR-86-01864 and AT&T under a Foundation Grant.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ö. Babaog∼lu, Stopping times of distributed consensus protocols: a probabilistic analysis. Information Processing Letters, vol. 25, no. 3, pp. 163–169, (May 1987).
Article Google Scholar
Ö. Babaog∼lu, On the reliability of consensus-based fault-tolerant distributed computing systems. ACM Trans. on Computer Systems, vol. 5, no. 3, pp. 394–416.
Google Scholar
Ö. Babaog∼lu and R. Drummond, Streets of Byzantium: Network architectures for fast reliable broadcasts. IEEE Trans. Software Eng., vol. SE-11, no. 6, pp. 546–554, June 1985.
MathSciNet Google Scholar
Ö. Babaog∼lu, P. Stephenson and R. Drummond, Reliable broadcast protocols and communication models: Tradeoffs and lower bounds. Springer-Verlag Distributed Computing, (to appear).
Google Scholar
W. Diffie and M. Hellman, New directions in cryptography. IEEE Trans. on Inf. Theory, vol. IT-22, pp. 644–654, 1976.
Article MathSciNet Google Scholar
D. Dolev, The Byzantine Generals strike again. Journal of Algorithms, vol. 3, no. 1, pp. 14–30, 1982.
Article MATH MathSciNet Google Scholar
D. Dolev, C. Dwork and L. Stockmeyer, On the minimal synchronism needed for distributed consensus. Journal of the ACM, vol. 34, no. 1, pp. 77–97, January 1987.
Article MATH MathSciNet Google Scholar
D. Dolev and H. R. Strong, Authenticated algorithms for Byzantine Agreement. SIAM J. Comput., vol. 12, no. 4, pp. 656–666, November 1983.
Article MATH MathSciNet Google Scholar
M. J. Fischer, The consensus problem in unreliable distributed systems (A Brief Survey). Tech. Rep. YALEU-DCS-RR-273, Dept. of Computer Science, Yale University, New Haven, Connecticut, June 1983.
Google Scholar
Fischer, M. and Lynch, N. A lower bound for the time to assure interactive consistency. Inform. Proc. Letters 14, no. 4, pp. 183–186, April 1982.
Article Google Scholar
M.J. Fischer, N.A. Lynch and M.S. Paterson, Impossibility of distributed consensus with one faulty process. Journal of the ACM, vol. 32, no. 2, pp. 374–382, April 1985.
Article MATH MathSciNet Google Scholar
H. Garcia-Molina, F. Pittelli and S. Davidson, Applications of Byzantine Agreement in database systems. Tech. Rep. TR 316, Princeton University, Princeton, New Jersey, June 1984.
Google Scholar
V. Hadzilacos, Issues of fault tolerance in concurrent computations. Ph.D. Thesis, Tech. Rep. TR-11-84, Aiken Computation Laboratory, Harvard University, Cambridge, Mass., June, 1984.
Google Scholar
V. Hadzilacos, Connectivity requirements for Byzantine Agreement under restricted types of failures. Springer-Verlag Distributed Computing, (to appear).
Google Scholar
W. Kim, Highly available systems for database applications. ACM Computing Surveys, vol. 16, no. 1, pp. 71–98, March 1984.
Article Google Scholar
L. Lamport, Using time instead of timeout for fault-tolerant distributed systems. ACM Trans. on Programming Languages and Systems, vol. 6, no. 2, pp. 254–280, April 1984.
Article Google Scholar
R. Metcalfe and D.R. Boggs, Ethernet: Distributed packet switching for local computer networks. Commun. ACM, vol. 19, no. 7, pp. 396–403, July 1976.
Article Google Scholar
G. Neiger and S. Toueg, Automatically increasing the fault-tolerance of distributed systems. Proc. of the 7th ACM Symposium on Principles of Distributed Computing, Toronto, Canada, August 1988. (to appear)
Google Scholar
M. Pease, R. Shostak and L. Lamport, Reaching agreement in the presence of faults. Journal of the ACM, vol. 27, no. 2, pp. 228–234, April 1980.
Article MATH MathSciNet Google Scholar
K.J. Perry and S. Toueg, Distributed agreement in the presence of processor and communication faults. IEEE Trans. on Software Engineering, vol. SE-12, no. 3, pp. 477–482, March 1986.
Google Scholar
B. Randell, P.A. Lee, and P.C. Treleaven, Reliability issues in computing system design. ACM Computing Surveys, vol. 10, no. 2, pp. 123–166, June 1978.
Article MATH Google Scholar
F. B. Schneider, Synchronization in distributed programs. ACM Trans. Programming Languages and Systems, vol. 4, pp. 125–148, April 1982.
Article MATH Google Scholar
F. B. Schneider, The state machine approach: A tutorial. This volume.
Google Scholar
F. B. Schneider and L. Lamport, Paradigms for distributed programs. In Distributed Systems: Methods and Tools for Specification, Paul, M. and Siegert H.J. (Eds.), Springer-Verlag Lecture Notes in Computer Science Vol. 190.
Google Scholar
D. P. Siewiorek and R. S. Swarz, The Theory and Practice of Reliable System Design. Digital Press, Belford, Mass. (1982).
Google Scholar
A. Z. Spector, Computer software for process control. Scientific American, vol. 251, no. 3, pp. 174–187, September 1984.
Article MathSciNet Google Scholar
Stallings, W. Local networks. ACM Computing Surveys, vol. 16, no. 1, pp. 3–41, March 1984.
Article MathSciNet Google Scholar
H.R. Strong and D. Dolev, Byzantine agreement. In Digest of Papers, Spring Compcon 83, San Francisco, California, pp. 77–81, March 1983.
Google Scholar

Download references

Author information

Özalp Babaoglu
Present address: Department of Mathematics, University of Bologna, Piazza Porta San Donato, 40127, Bologna, Italy

Authors and Affiliations

Department of Computer Science, Cornell University, 14853-7501, Ithaca, New York
Özalp Babaoglu

Authors

Özalp Babaoglu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Barbara Simons Alfred Spector

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Babaoglu, Ö. (1990). The “Engineering” of fault-tolerant distributed computing systems. In: Simons, B., Spector, A. (eds) Fault-Tolerant Distributed Computing. Lecture Notes in Computer Science, vol 448. Springer, New York, NY. https://doi.org/10.1007/BFb0042341

Download citation

DOI: https://doi.org/10.1007/BFb0042341
Published: 08 June 2005
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-97385-2
Online ISBN: 978-0-387-34812-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics