Abstract
We view the design of fault-tolerant computing systems as an engineering endeavor. As such, this activity requires understanding the theoretical limitations and the scope of the feasible designs. We survey the impact that various environment characteristics and design choices have on the resultant system properties. We propose a single metric—the system reliability—as an appropriate measure for exploring tradeoffs among a potentially-large design space.
Partial support for this work was provided by the National Science Foundation under Grant DCR-86-01864 and AT&T under a Foundation Grant.
Preview
Unable to display preview. Download preview PDF.
References
Ö. Babaog∼lu, Stopping times of distributed consensus protocols: a probabilistic analysis. Information Processing Letters, vol. 25, no. 3, pp. 163–169, (May 1987).
Ö. Babaog∼lu, On the reliability of consensus-based fault-tolerant distributed computing systems. ACM Trans. on Computer Systems, vol. 5, no. 3, pp. 394–416.
Ö. Babaog∼lu and R. Drummond, Streets of Byzantium: Network architectures for fast reliable broadcasts. IEEE Trans. Software Eng., vol. SE-11, no. 6, pp. 546–554, June 1985.
Ö. Babaog∼lu, P. Stephenson and R. Drummond, Reliable broadcast protocols and communication models: Tradeoffs and lower bounds. Springer-Verlag Distributed Computing, (to appear).
W. Diffie and M. Hellman, New directions in cryptography. IEEE Trans. on Inf. Theory, vol. IT-22, pp. 644–654, 1976.
D. Dolev, The Byzantine Generals strike again. Journal of Algorithms, vol. 3, no. 1, pp. 14–30, 1982.
D. Dolev, C. Dwork and L. Stockmeyer, On the minimal synchronism needed for distributed consensus. Journal of the ACM, vol. 34, no. 1, pp. 77–97, January 1987.
D. Dolev and H. R. Strong, Authenticated algorithms for Byzantine Agreement. SIAM J. Comput., vol. 12, no. 4, pp. 656–666, November 1983.
M. J. Fischer, The consensus problem in unreliable distributed systems (A Brief Survey). Tech. Rep. YALEU-DCS-RR-273, Dept. of Computer Science, Yale University, New Haven, Connecticut, June 1983.
Fischer, M. and Lynch, N. A lower bound for the time to assure interactive consistency. Inform. Proc. Letters 14, no. 4, pp. 183–186, April 1982.
M.J. Fischer, N.A. Lynch and M.S. Paterson, Impossibility of distributed consensus with one faulty process. Journal of the ACM, vol. 32, no. 2, pp. 374–382, April 1985.
H. Garcia-Molina, F. Pittelli and S. Davidson, Applications of Byzantine Agreement in database systems. Tech. Rep. TR 316, Princeton University, Princeton, New Jersey, June 1984.
V. Hadzilacos, Issues of fault tolerance in concurrent computations. Ph.D. Thesis, Tech. Rep. TR-11-84, Aiken Computation Laboratory, Harvard University, Cambridge, Mass., June, 1984.
V. Hadzilacos, Connectivity requirements for Byzantine Agreement under restricted types of failures. Springer-Verlag Distributed Computing, (to appear).
W. Kim, Highly available systems for database applications. ACM Computing Surveys, vol. 16, no. 1, pp. 71–98, March 1984.
L. Lamport, Using time instead of timeout for fault-tolerant distributed systems. ACM Trans. on Programming Languages and Systems, vol. 6, no. 2, pp. 254–280, April 1984.
R. Metcalfe and D.R. Boggs, Ethernet: Distributed packet switching for local computer networks. Commun. ACM, vol. 19, no. 7, pp. 396–403, July 1976.
G. Neiger and S. Toueg, Automatically increasing the fault-tolerance of distributed systems. Proc. of the 7th ACM Symposium on Principles of Distributed Computing, Toronto, Canada, August 1988. (to appear)
M. Pease, R. Shostak and L. Lamport, Reaching agreement in the presence of faults. Journal of the ACM, vol. 27, no. 2, pp. 228–234, April 1980.
K.J. Perry and S. Toueg, Distributed agreement in the presence of processor and communication faults. IEEE Trans. on Software Engineering, vol. SE-12, no. 3, pp. 477–482, March 1986.
B. Randell, P.A. Lee, and P.C. Treleaven, Reliability issues in computing system design. ACM Computing Surveys, vol. 10, no. 2, pp. 123–166, June 1978.
F. B. Schneider, Synchronization in distributed programs. ACM Trans. Programming Languages and Systems, vol. 4, pp. 125–148, April 1982.
F. B. Schneider, The state machine approach: A tutorial. This volume.
F. B. Schneider and L. Lamport, Paradigms for distributed programs. In Distributed Systems: Methods and Tools for Specification, Paul, M. and Siegert H.J. (Eds.), Springer-Verlag Lecture Notes in Computer Science Vol. 190.
D. P. Siewiorek and R. S. Swarz, The Theory and Practice of Reliable System Design. Digital Press, Belford, Mass. (1982).
A. Z. Spector, Computer software for process control. Scientific American, vol. 251, no. 3, pp. 174–187, September 1984.
Stallings, W. Local networks. ACM Computing Surveys, vol. 16, no. 1, pp. 3–41, March 1984.
H.R. Strong and D. Dolev, Byzantine agreement. In Digest of Papers, Spring Compcon 83, San Francisco, California, pp. 77–81, March 1983.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1990 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Babaoglu, Ö. (1990). The “Engineering” of fault-tolerant distributed computing systems. In: Simons, B., Spector, A. (eds) Fault-Tolerant Distributed Computing. Lecture Notes in Computer Science, vol 448. Springer, New York, NY. https://doi.org/10.1007/BFb0042341
Download citation
DOI: https://doi.org/10.1007/BFb0042341
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-97385-2
Online ISBN: 978-0-387-34812-4
eBook Packages: Springer Book Archive