Advertisement

The Evolution of Fault Tolerant Computing at the Charles Stark Draper Laboratory, 1955–85

  • Albert L. HopkinsJr.
  • Jaynarayan H. Lala
  • T. Basil SmithIII
Part of the Dependable Computing and Fault-Tolerant Systems book series (DEPENDABLECOMP, volume 1)

Abstract

Fault-tolerant computing became an issue of importance at the Draper Laboratory at the same time that digital computers began to be incorporated into guidance, navigation, and control systems. Early systems emphasized fault avoidance, with satisfactory results. More complex systems, which followed, incorporated redundancy.

Early redundancy architecture was constrained by size, weight, and cost penalties, and tended toward standby dual forms. As integrated circuits grew in complexity, more massive forms of redundancy evolved in Draper’s architectures.

The challenge of full-time, full authority control of commercial aircraft motivated a number of research activities directed toward the realization of extremely low system failure rates. These activities revealed substantial problems to be encountered in the practical realization of redundant systems, even though such systems seem extremely simple in abstraction. One example of such problems is the synchronization of redundant clocks, where a fundamental rule was discovered that later emerged in a more general form as the “Byzantine Generals Problem”. A hybrid-redundant multiprocessor with reconfigurable triads (FTMP) resulted from the research.

Recent research has capitalized on large scale integrated circuits, as well as fault-tolerant system architectures of the past, to yield a modular n-redundant, tightly synchronized computer, virtually transparent to software, thus able to capture software written for simplex systems, including certain n-version software forms. Computers of this type are being deployed in numerous applications.

Keywords

Memory Module Intermittent Fault Byzantine Fault Large Scale Integrate Circuit Fault Symptom 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. L. S. Alger and J. H. Lala, “A Real Time Operating System for a Nuclear Power Plant Computer”, IEEE Computer Society Real Time Systems Symposium, New Orleans, LA, December 1986.Google Scholar
  2. R. L. Alonso, A. L. Hopkins, Jr., and H. A. Thaler, “A Multiprocessing Structure,” IEEE Computer Conf., Chicago, IL Sept. 1967, IEEE Cat. No. 16C51.Google Scholar
  3. A. A. Avizienis, “Architecture of Fault-Tolerant Computing Systems,” 1975 International Symposium on Fault-Tolerant Computing, Paris, France, June 1975, IEEE Computer Society, IEEE Cat. No. 75CH0974–6C.Google Scholar
  4. W. G. Bouricius, W. C. Carter, D. C. Jessep, P. R. Scheider, and A. B. Wadia, “Reliability Modeling for Fault-Tolerant Computers,” IEEE Trans. Comput., vol. C-20 No. 11, pp. 1306–1311, Nov. 1971.CrossRefGoogle Scholar
  5. M. A. Breuer, “Testing for Intermittent Faults in Digital Circuits,” IEEE Trans, comput., vol. c-22 no. 3, pp. 241–246, Mar. 1973.MathSciNetGoogle Scholar
  6. L. D. Brock and J. H. Lala, “Advanced Information Processing System: Status Report,” IEEE National Aerospace and Electronics Conf. ( NAECON ), Dayton, OH, May 1986.Google Scholar
  7. C. F. Colson, A. L. Hopkins, Jr., and J. H. Lala, “Program and Input-Output Management for a Distributed Fault-Tolerant Digital System,” 1975 International Symposium on Fault-Tolerant Computing, Paris, France, June 1975, IEEE Computer Society, IEEE Cat. no. 75CH0974–6C.Google Scholar
  8. W. M. Daly, A. L. Hopkins, Jr., and J. F. McKenna, “A Fault-Tolerant Digital Clocking System,” 1973 International Symposium on Fault-Tolerant Computing, Palo Alto, CA, June 1973, IEEE Computer Society, IEEE Cat. No. 73CH0772–4C.Google Scholar
  9. M. N. Desai, J. C. Deckert, and J. J. Deyst, “Dual-Sensor Failure Identification Using Analytic Redundancy,” AIAA Journal of Guidance and Control, Vol. 2, No. 3, May-June 1979, pp. 213–220.CrossRefGoogle Scholar
  10. J. J. Deyst and A. L. Hopkins, Jr., “Highly Integrated Avionics” Astronautics and Aeronautics, AIAA, September 1978, pp. 30–41.Google Scholar
  11. R. J. Filene and A. I. Green, “A Simple Executive for a Fault-Tolerant, Real-Time, Multiprocessor,” in Proc. 1971 IEEE International Computer Society Conference, IEEE cat. No. 71C41-C.Google Scholar
  12. R. J. Filene and W. M. Daly, “The Reliability Impact of Mission Abort Strategies on Redundant Flight Computer Systems,” IEEE Trans, comput., vol. C-23, No. 7, July 1974.Google Scholar
  13. A. L. Hopkins, Jr., “A Fault-Tolerant Information Processing Concept for Space Vehicles,” IEEE Trans, comput., vol. C-20, No. 11, pp. 1394–1403, Nov. 1971.CrossRefGoogle Scholar
  14. A. L. Hopkins, Jr., and T. B. Smith, III, “The Architectural Elements of a Symmetric Fault-Tolerant Multiprocessor,” IEEE Trans. Corn-put., vol. C-24, no. 5, pp. 498–505, May 1975.Google Scholar
  15. A. L. Hopkins, Jr., T. B. Smith, III, and J. H. Lala, “FTMP A Highly Reliable Fault-Tolerant Multiprocessor for Aircraft,” Proc. IEEE, vol. 66, No. 10, Oct. 1978 (a).CrossRefGoogle Scholar
  16. A. L. Hopkins, Jr., T. B. Smith, III, and J. H. Lala, “The Problem of Validation for a High Reliability, Fault Tolerant Computer,” 1978 (b), not published.Google Scholar
  17. J. H. Lala and A. L. Hopkins, Jr., “Survival and Dispatch Probability Models for the FTMP Computer,” 1978 International Symposium on Fault-Tolerant Computing, Toulouse, France, June 1978, IEEE Computer Society, IEEE Cat. No. 81CH1600–6.Google Scholar
  18. J. H. Lala, “Interactive Reductions in the Number of States in Markov Reliability Analysis,” AIAA Guidance and Control Conference, Gatlinburg, TN, August 1983 (a).Google Scholar
  19. J. H. Lala, “Fault Detection, Isolation, and Reconfiguration in FTMP: Methods and Experimental Results,” 5th IEEE-AIAA Digital Avionics Systems Conference, Seattle, Oct. 1983 (b).Google Scholar
  20. J. H. Lala, “An Advanced Information Processing System,” 6th AIAA-IEEE Digital Avionics Systems Conference Baltimore, MD, Dec. 1984.Google Scholar
  21. J. H. Lala, “Advanced Information Processing System: Fault Detection and Error Handling,” AIAA Guidance, Navigation and Control Conf., Snowmass, CO, Aug. 1985.Google Scholar
  22. J. H. Lala, “A Byzantine Resilient Fault-Tolerant Computer for Nuclear Power Plant Applications,” 1986 International Symposium on Fault-Tolerant Computing, Vienna, July 1986 (a).Google Scholar
  23. J. H. Lala, L. S. Alger, R. J. Gauthier, M. J. Dzwonczyk, “A Fault Tolerant Processor Architecture to Meet Rigorous Failure Requirements”, 7th AIAA-IEEE Digital Avionics Systems Conference, Fort Worth, TX, October 1986 (b).Google Scholar
  24. F. P. Mathur, “Reliability Modeling, Analysis and Prediction of Ultrareliable Fault-Tolerant Digital Systems,” 1971 International Symposium on Fault-Tolerant Computing, Pasadena, CA, March 1971. IEEE Computer Society, IEEE Cat. No. 71C6-C.Google Scholar
  25. N. D. Murray, A. L. Hopkins, Jr., and J. H. Wensley, “Highly Reliable Multiprocessors,” in AGARDograph #224, Integrity in Electronic Flight Control Systems, P. Kurzhals, Ed., AGARD-NATO, Neuilly-Sur-Seine, France, Apr. 1977.Google Scholar
  26. Y. W. Ng and A. A. Avizienis, “A Unifying Reliability Model for Closed Fault-Tolerant Systems,” 1975 International Symposium on Fault-Tolerant Computing, Paris, France, June 1975, IEEE Computer Society, IEEE Cat. No. 75CH0974–6C.Google Scholar
  27. T. B. Smith, III, “A Highly Modular Fault-Tolerant Computing System,” Ph. D Thesis, Mass. Inst, of Technology, Dept. of Aeronautics and Astronautics, Cambridge, MA, Nov. 1973.Google Scholar
  28. T. B. Smith, III, “A Damage-and Fault-Tolerant Input/Output Network,” IEEE Trans. Comput., vol. C-24, No. 5, pp. 505–512, May 1975.CrossRefGoogle Scholar
  29. T. B. Smith, III, “Test Algorithms for Active Input/Output Networks,” 1975 International Symposium on Fault-Tolerant Computing, Paris, France, June 1975, IEEE Computer Society, IEEE Cat. No. 75CH0974–6C.Google Scholar
  30. T. B. Smith, III, “Fault-Tolerant Clocking System,” 1981 International Symposium on Fault-Tolerant Computing, Portland, ME, June 1981, IEEE Computer Society, IEEE Cat. No. 81CH1600–6.Google Scholar
  31. J. H. Wensley, et al., “SIFT: The Design and Analysis of a Fault-Tolerant Computer for Aircraft Control,” Proc. IEEE, vol. 66, pp. 1240–1255, Oct. 1978.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag/Wien 1987

Authors and Affiliations

  • Albert L. HopkinsJr.
    • 1
  • Jaynarayan H. Lala
    • 2
  • T. Basil SmithIII
    • 3
  1. 1.ITP Boston, Inc.CambridgeUSA
  2. 2.C.S. Draper Laboratory, Inc.CambridgeUSA
  3. 3.Hamilton Standard Corp.CarrolltonUSA

Personalised recommendations