Advertisement

Experiences in Fault Tolerant Computing, 1947 – 1971

  • W. C. Carter
Conference paper
Part of the Dependable Computing and Fault-Tolerant Systems book series (DEPENDABLECOMP, volume 1)

Abstract

This essay is based upon my recollections pertinent to fault tolerant computing. The material included is determined by my interactions with talented and adventurous colleagues and with the general computing community. This means that many worthy and interesting projects will be slighted. I apologize to all who worked on these projects, and blame my memory. This essay begins with my work on the ENIAC, (modified to be a writable ROM microprogram controlled computer), and continues through my work helping with the design of dependable (for the period) data processing systems at Raytheon, Datamatic, and Honeywell. My report on work at IBM begins with HARVEST, includes S/360 and ends with my early work at IBM Research in fault tolerant computing, including projects started in the early 1970’s, some of which were published later. After the founding of the IEEE FTTC and the Annual Fault Tolerant Computing Symposia in 1971 there is so much published material that I shall let the professional historians sort it out.

Keywords

Fault Location Magnetic Tape Cyclic Code Parity Check Matrix Core Memory 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Bibliography

  1. Agnew, R. W., et al, 1967: An Approach to Self-Repairing Computers, Dig. 1st IEEE Computer Group Conf., pp. 37–46.Google Scholar
  2. Alonso, R. L., Blair-Smith H., Hopkins, A. L., 1963: Some Aspects of the Logical Design of a Control Computer: A Case Study, IEEE TEC, pp. 687–698.Google Scholar
  3. Anderson, D. A., Metze G., 1973: Design of Totally Selfchecking Check Circuits for m-out-of-n Codes, IEEE TC, pp. 263–269.Google Scholar
  4. Anderson, J. E., 1968: 7 Years of 0A0, 1968 Product Assurance Conf., Hofstra University.Google Scholar
  5. Arlat, J, Carter, W. C., 1984: Implementation and Evaluation of a (b,k)-Adjacent Error-Correcting/Detecting Scheme for Supercomputer Systems, IBM J. Ramp;D, 28, No. 2, pp. 159–169.Google Scholar
  6. Ball, M., Hardie, F., 1969: Effect amp; Detection of Intermittent Failures in Digital Systems, FJCC, V. 35, pp. 329–336.Google Scholar
  7. Bark, A., Kinne, C. B., 1953: The Application of Pulse Position Modulation to Digital Computers, Proc. NEC, pp. 656–664.Google Scholar
  8. Basche, C. J., Bucholz, W., Rochester, N., 1954: The BM 702, an Electronic Data Processing Machine for Business, JACM, V. 1, pp. 149–169.CrossRefGoogle Scholar
  9. Berger, J. M., 1961: A Note on Error Detection Codes for Asymmetric Channels, Info. amp; Control, pp. 68–73.Google Scholar
  10. Bjork, L. A., 1973: Recovery Scenario for a DB/DC System, Proc. ACM Annual Conf., pp. 142–146.Google Scholar
  11. Block, R. M. et al, 1948: The Logical Design of the Raytheon Computer, MTAC, Bu. Stds.Google Scholar
  12. Bock, R. V., Toth, A. P., 1965: Hardware amp; Software for Maintenance in the B5500 Processor, IEEE Int. Conf. pp. 65–72.Google Scholar
  13. Bossen, D. C., 1970: b-adjacent Error Correction, IBM J. Ramp;D, V. 14, pp. 402–408.MATHMathSciNetCrossRefGoogle Scholar
  14. Bossen, D. C., Hong, S. J., 1971: Cause Effect Analysis for Multiple Fault Detection in Combinational Networks, IEEE TC, C-20, No. 11, pp. 1252–1263.Google Scholar
  15. Bossen, D. C., Hsiao, M. Y., 1980: A System Solution to the Memory Soft Error Problem, IBM J. Ramp;D, No. 3, pp. 390–397.Google Scholar
  16. Bossen, D. C., Hsiao, M. Y., 1982: Model for Transient and Permanent Error-detection Sc Fault-isolation Coverage, IBM J. J. Ramp;D, 26, No. 1, pp. 67–77.Google Scholar
  17. Bouricius, W. G., 1953: Operating Experience with the Los Alamos 701, EJCC, pp. 45–47.Google Scholar
  18. Bouricius, W. G., et al, 1967: Investigations in the Design of an Automatically Repaired Computer, Dig. 1st IEEE Comp. Conf.Google Scholar
  19. Bouricius, W. G., Carter, W. C., Schneider, P. R., 1969: Reliability Modeling Techniques for Self-Repairing Computer Systems, Proc. ACM Ann. Conf., pp. 295–309Google Scholar
  20. Bouricius, W. G., et al, 1971: Algorithms for Detection of Faults in Logic Circuits, IEEE TC, C-20, pp. 1258–1264.Google Scholar
  21. Bouricius, W. G., Carter, W. C., Roth, J. P., Schneider, P. R., 1972: US Patent No. 3,665,173; Triple Modular Redundancy/Sparing.Google Scholar
  22. Bouricius, W. G., Carter, W. C., Roth, J. P., Schneider, P. R., 1972: US Patent No. 3,665,174; Error Tolerant ALU.Google Scholar
  23. Bouricius, W. G., Carter, W. C., Roth, J. P., Schneider, P. R., 1972: US Patent No. 3,665,175; Dynamic Storage Address Blocking to Achieve Error Toleration in Addressing Circuitry.Google Scholar
  24. Bouricius, W. G., Carter, W. C., Roth, J. P., Schneider, P. R., 1972: US Patent No. 3,665,418; Status Switching in an Automatically Repaired Computer.Google Scholar
  25. Bucholz, W., 1953: The System Design of The IBM 701 Computer, Proc. IRE, 41, pp. 1262–1275.CrossRefGoogle Scholar
  26. Bucholz, W., Ed., 1962: Planning a Computer System ( Project Stretch ), McGraw Hill.Google Scholar
  27. Burks, Burks, A. R., 1981: The ENIAC: First General Purpose Electronic Computer, Annals Hist. Comp., Y. 3, No. 4 pp. 310–399.Google Scholar
  28. Burnstine, D. C., Eppard, W. H., 1966: Maintenance Strategy Diagramming Technique, 1966 Annual Symp. on Rei. pp. 75–83.Google Scholar
  29. Carter, W. C., Mekota, J. E., 1954: Panel Discussion, Redundancy Checking for Small Digital Computers, EJCC, pp. 56–57.Google Scholar
  30. Carter, W. C., 1957: A New Large Scale Data Handling System-DATAmatic 1000, ACM Symp. New Computers, A Report from the Manufacturers, pp. 36–57.Google Scholar
  31. Carter, W. C., 1958: Automatic Machine and Program Testing Routines, 5th Annual Symp. on Comp. Sc Data Processing, U. Colorado, Boulder, Colorado.Google Scholar
  32. Carter, W. C., et al, 1964: Design of Serviceability Features for the IBM System/360, IBM J Ramp;D, V. 8, No. 4, pp. 115–126.MATHCrossRefGoogle Scholar
  33. Carter, W. C., Schneider P. R., 1968: Design of Dynamically Checked Computer Systems, Inf. Proc. 68, IFIPS, pp. 878–883.Google Scholar
  34. Carter, W. C., Jessep, D. C., Wadia, A. B., 1970a: Error-Free Decoding for Failure Tolerant Memories, Proc. 1st IEEE Comp. Group Conf. pp. 25–30.Google Scholar
  35. Carter, W. C., et al, 1970b: Design Techniques for MARCS (Modular Architecture for Reliable Computer Systems), IBM RAI2.Google Scholar
  36. Carter, W. C., Bouricius W. G., 1971a: A Survey of Fault Tolerant Architecture and Its Evaluation, COMPUTER, Jan., pp. 10–16 (See Related Fault Tolerance papers in the issue).Google Scholar
  37. Carter, W. C., Wadia, A. B., Jessep, D. C., 1971b: Implementation of Checkable Acyclic Automata by Morphic Boolean Functions, Pr. Smp. Cmp. Sc Auto. Poly. Tech. Inst. Brooklyn, pp. 466–482.Google Scholar
  38. Carter, W. C., et al, 1971c: Logic Design for Dynamic and Interactive Recovery, IEEE TC, C-20, pp. 1300–1306.Google Scholar
  39. Carter, W. C., Hsieh, E. P., Wadia, A. B., 1973: US Patent No. 3,766,521; Multiple b-Adjacent Group Correction and Detection Codes and Self-Checking Translators Therefor.Google Scholar
  40. Carter, W. C., McCarthy, C. E., 1976: Implementation of an Experimental Fault-Tolerant Memory System, IEEE TC, pp. 557–568.Google Scholar
  41. Carter, W. C., et al, 1977: Cost Effectiveness of Self-Checking Computer Design, Proc. FTCS-7, pp. 117–123.Google Scholar
  42. Carter, W. C., Wadia, A. B., 1980: Design and Analysis of Codes and Their Self-checking Circuit Implementations for Correction and Detection of Multiple b-adjacent Errors, Proc. FTCS-10, pp. 35–40.Google Scholar
  43. Carter, W. C., 1985: Chapter in Resilient Computing Systems, T. Anderson, Ed.Google Scholar
  44. Chang, H. Y., E. Manning, Metze, G., 1970: Fault Diagnosis of Digital Systems, Wiley-Interscience, N. Y.MATHGoogle Scholar
  45. Chen, C. L., Hsiao, M. Y., 1984: Error-Correction Codes for Semiconductor Memory Applications: A State-of-the-Art Review, IBM J. Ramp;D, 28, No. 2, pp. 124–134.Google Scholar
  46. Clippinger, R. F., et al, 1953: The Programming of Stored Program Computers, SIAM Journal, V. l, Nos. 1,2, 3.Google Scholar
  47. Cooper, A. E., Chow, W. T., 1976: Development of Onboard Space Computers, IBM J. Ramp;D, 20, pp. 5–19.Google Scholar
  48. Creveling, C. J., 1956: Increasing the Reliability of Electronic Equipment by the Use of Redundant Circuits, Proc. IRE, V. 44, pp. 509–515.CrossRefGoogle Scholar
  49. Davies, C. T., 1973: Recovery Semantics for a DB/DC System, Proc. ACM Annual Conf. pp. 136–144.Google Scholar
  50. Davis, D. J., 1952: An Analysis of Failure Data, J. Am. Stat. Soc., No. 5, pp. 104–135.Google Scholar
  51. Davis, M. E., 1983: Use of the Electronic Data-Processing Systems in the Life Insurance Business, EJCC, pp. 11–17.Google Scholar
  52. Dickinson, M. M., et al, 1964: Saturn V Launch Vehicle Digital Computer Sc Adapter, FJCC, V. 26, pp. 501–516.Google Scholar
  53. Eachus, J. J., 1953: Group Discussion on Diagnostic Checks, EJCC, p. 119.Google Scholar
  54. Eichelberger, E. B.,Williams, T. J., 1977: A Logic Design Structure for LSI Testability, Proc. D. A. Conf. pp. 462–468.Google Scholar
  55. Eldred, R. D., 1959: Test Routines Based on Symbolic Logic Statements, J ACM, V. 6, No. 1, pp. 33–36MATHMathSciNetCrossRefGoogle Scholar
  56. Epstein, B amp; M. Sobol,1953: Life Testing, Journal of American Statistical Assoc., V. 48, No. 263, pp. 486–502.Google Scholar
  57. E. R. A., 1950: High Speed Computing Devices, McGraw-Hill.Google Scholar
  58. Estrin, G., 1953: The Electronic Computer at the Institute for Advanced Study, MTAC, 7, pp. 108–110.MATHGoogle Scholar
  59. Everett, R. R., et al, 1957: SAGE-A Data-Processing System for Air Defense, EJCC, pp. 148–155.Google Scholar
  60. Falkoff, A. D., et al, 1964: A Formal Description of System/360 IBM Sys. J. V. 3, No. 3, pp. 193–262.Google Scholar
  61. Fitzsimons, R. M., 1972: TRIDENT-A New Maintenance Weapon, Proc. FJCC, 41, pp. 255–267.Google Scholar
  62. Flehinger, B. J., 1958: Reliability Improvement through Redundancy at Various System Levels, IBM J Ramp;D, pp. 223–245.Google Scholar
  63. Forbes, R. E., et al, 1965: A Self-Diagnosable Computer, FJCC, V. 27, Part 1, pp. 1073–1087.Google Scholar
  64. Forrester, J. W., 1951: Digital Information Storage in Three Dimensions Using Magnetic Cores, J. Ap. Physics, pp. 44–48.Google Scholar
  65. Fox, J. L., 1975: Availability Design of the S/370 Model 168 Multiprocessor, Proc. 2nd USA-Japan Comp. Conf. pp. 52–57.Google Scholar
  66. Franaszek, P. E., 1972: US Patent No. 3,689,899; Run-length-limited Variable Length Coding with Error Propagation Limitation.Google Scholar
  67. Gluck, S., 1965: Impact of Scratchpads in Design: Multifunctional Scratchpad Memories in the Burroughs B8500, FJCC, pp. 661–667.Google Scholar
  68. Goel, P., 1980: An Implicit Enumeration Algorithm to Generate Tests for Combinational Logic Circuits, FTCS-10, pp. 145–151.Google Scholar
  69. Goldberg, J., et al, 1972: Survey of Fault Tolerant Computing Systems, SRI Inc. Report.Google Scholar
  70. Goldstine, H. H., von Neumann, J., 1947: Planning amp; Coding of Problems for an Electronic Computing Instrument, Inst, of Advanced Study, Princeton.Google Scholar
  71. Griesmer, J. H., R. E. Miller, Roth, J. P., 1962: The Design of Digital Circuits to Eliminate Catastrophic Failures, Redundancy Tech. for Comp. Sys., Spartan Books.Google Scholar
  72. Hackl, F. J., Shirk, R. W., 1965: An Integrated Approach to Automated Computer Maintenance, IEEE Conf. Ree. on Switching Theory Sc Logic Des., pp. 289–302.Google Scholar
  73. Hamming, R. W.,1953: Error Detecting Sc Error Correcting Codes, BSTJ, 29, pp. 147–160.Google Scholar
  74. Hardie, F., Suhocki, R. S., 1967: Design Se Use of Fault Simulation for Saturn Computer Design, IEEE TC, pp. 412–429.Google Scholar
  75. Harrison, T. J., et al, 1981: Evolution of Small Real-Time IBM Computer Systems, IBM J. Ramp;D, V. 25, pp. 441–453.CrossRefGoogle Scholar
  76. Harvard Proc., 1949: Proc. of a 2nd Symp. on Large Scale Digital Calculating Machinery, Annals Comp. Lab., V. X XII.Google Scholar
  77. Hsiao, M. Y., 1970: A Class of Optimal Minimum Odd-Weight-Column SEC/DED codes, IBM J Ramp;D, V. 14, pp. 395–403.CrossRefGoogle Scholar
  78. Hsiao, M. Y., et al, 1981: Reliability, Availability, amp; Serviceability of IBM Computer Systems: A Quarter Century of Progress, IBM J Ramp;D, V. 25, pp. 453–465.CrossRefGoogle Scholar
  79. Ibarra, O. H., Sahni, S. J., 1975: Polynomially Complete Fault Detection Problems, IEEE TC, C-24, pp. 242–253.Google Scholar
  80. James, S. E., 1981: Evolution of Real-Time Computer Systems for Manned Spaceflight, IBM J. Ramp;D, V. 25, pp. 417–429.CrossRefGoogle Scholar
  81. Jarema, D. R., Sussenguth, E. H., 1981: IBM Data Communications: A Quarter Century of Evolution amp; Progress, IBM J. Ramp;D, V. 25, pp. 391–405.CrossRefGoogle Scholar
  82. Keeler, J., 1967: Special Issue on IBM 9020, IBM Sys. J. 6, No. 2.Google Scholar
  83. Kopp, R., 1953: Experience with the Air Force UNIVAC, EJCC, pp. 62–67.Google Scholar
  84. Lancto, D. C., Rockefeller, R. L. 1967: The Operational Error Analysis Program, IBM Sys. J., 6, No. 2, pp. 103–149.CrossRefGoogle Scholar
  85. Laprie, J. -C., 1985: Dependable Computing amp; Fault Tolerance: Concepts amp; Terminology, Proc. FTCS-15, pp. 2–14.Google Scholar
  86. Mauchly, J. W., 1953: The Advantages of Built-in Checking, EJCC,pp, 99–101.Google Scholar
  87. Metropolis, N., Worlton, J., 1980: A Trilogy on Errors in Computing History, Annals Hist. Comp., V. 2, No. 1, pp. 49–59. (Excellent list of 93 references).Google Scholar
  88. Moore, E. F., 1956: Gedanken-Experiments on Sequential Machines, Automata Studies, Princeton, pp. 129–156.Google Scholar
  89. Moore, E. F., Shannon, C. E., 1956: Reliable Circuits Using Less Reliable Relays, J. Franklin Inst., pp. 191–208; 281–297.MathSciNetCrossRefGoogle Scholar
  90. Murray, F. J., 1953: Acceptance Test for the Raytheon Hurricane Computer, EJCC, pp. 48–52 (RAYDAC).Google Scholar
  91. No. 1 ESS Issues, 1964: BSTJ, V. 43, No. 5, pp. 1831–2610.Google Scholar
  92. OSVS2, 1985: 0SVS2 MVS Overview, No. GC20-0954-0, IBM Brnch Of.Google Scholar
  93. Patel, A. M., Hong, S. J., 1974: Optimal Rectangular Code for High Density Magnetic Tapes, IBM J. Ramp;D, 18, pp. 579–588.CrossRefGoogle Scholar
  94. Perry, M. N., Plügge, W. P., 1961: American Airlines ‘SABRE’ Electronics Reservation System, WJCC, pp. 563–601.Google Scholar
  95. Peterson, W. W., 1961: Error Correcting Codes, MIT Press.MATHGoogle Scholar
  96. Preiss, R. J., 1965: The Use of Fault Location Tests in Prototype Bring-up, Proc. IFIP65, pp. 511–517.Google Scholar
  97. Preiss, R. J., 1972: Design Automation of Digital Systems, M. E. Breuer, Ed., V. 1, pp. 335–410.Google Scholar
  98. Preparata, F. P., Metze, G., Chien, R. T., 1967: On the Connection Assignment Problem of Diagnosable Systems, IEEE TC, C-16, No. 6, pp. 848–854.Google Scholar
  99. Proceedings of the ACM Conference, 1952: Pittsburgh, Pa. Several papers on the History of Computing, pp. 1–32.Google Scholar
  100. Putzulu, G. R., Roth, J. P., 1971: An Heuristic Algorithm for the Testing of Asynchronous Circuits, IEEE TC, pp. 639–648.Google Scholar
  101. Ralston, A., 1976: The Encyclopedia of Computer Science, McGraw-Hill, N. Y.Google Scholar
  102. Randell, B., (Ed.) 1973: The Origins of Digital Computers, Springer-Verlag.Google Scholar
  103. Randell, B., 1981: Comments on Burks, A. W., 1981.Google Scholar
  104. Raymond, G. A., 1958: A Transistor-Circuit Chassis for High Reliability in Missle Guidance Systems, EJCC, pp. 132–135.Google Scholar
  105. Reed, I. S., 1954: A Class of Multiple-error-correcting Codes and Their Decoding Scheme, Trans. IRE, IT-4, pp. 38–40.Google Scholar
  106. Roth, J. P., 1966: Diagnosis of Automata Failures: A Calculus and a Method, IBM J. Ramp;D, pp. 278–291.Google Scholar
  107. Roth, J. P., Bouricius, W. G., Carter, W. C., Schneider, P. R., 1967: Phase II of an Architectural Study for a Self-Repairing Computer, SAMSO TR-67–106.Google Scholar
  108. Schneider, P. R., 1967: On the Necessity to Examine D-Chains in Diagnostic Test Generation-An Example, IBM J. Ramp;D, pp. 114.Google Scholar
  109. Sellers, F. F., Hsiao, M. Y., Bearnson, L. W., 1968a: Error Detecting Logic for Digital Computers, McGraw-Hill, N. Y.Google Scholar
  110. Sellers, F. F., Hsiao, M. Y., Bearnson, L. W., 1968b: Analyzing Errors with the Boolean Difference, IEEE TC, pp. 676–683.Google Scholar
  111. Shannon, C., 1938: A Symbolic Analysis of Relay amp; Switching Circuits, AIEE Trans., 57, pp. 713–723.Google Scholar
  112. Shepe, P. D. Jr., Kirsch, R. A., 1953: SEAC-Review of Three Years of Operation, EJCC, pp. 83–90.Google Scholar
  113. Shiowitz, M., et al, 1956: Functional Description of the NCR 304 Data Processing System for Business Applications, EJCC, pp. 34–39.Google Scholar
  114. Smith, J. E., Lam, P., 1983: A Theory of Totally Self-Checking System Design, IEEE TC, pp. 491–499.Google Scholar
  115. Snyder, S. S., 1980: Computer Advances Pioneered by Cryptologic Organizations, Ann. Hist. Comp., V. 2, No. 1, pp. 60–71.CrossRefGoogle Scholar
  116. Stanga, D. C., 1967: UNIVAC 1108 Multiprocessor System, AFIPS, SJCC, pp. 45–51.Google Scholar
  117. Tryon, J. G., 1962: Quadded Logic, Redundancy Techniques for Computing Systems, Spartan Books, pp. 205–228.Google Scholar
  118. von Neumann, J., 1956: Probabilistic Logics Sc the Synthesis of Reliable Organisms from Unreliable Components, Automata Studies, Princeton, pp. 43–97.Google Scholar
  119. Wadia, A. B., 1970: Investigation into the Design of Dynamically Checked Arithmetic Units, Ph. D. Thesis, Harvard.Google Scholar
  120. Walters, L. R., 1953: Diagnostic Programming Techniques for the IBM Type 701 E. D. P. M., Conv. Rec., IRE Nat. Convention.Google Scholar
  121. Weik, M. H., 1955: A Survey of Domestic Electronics Digital Computing Systems, BRL, Rpt. No. 971, Aberdeen Proving Ground, Md.Google Scholar
  122. Weik, M. H., 1957: A 2nd Survey of Domestic Electronics Digital Computing Systems, BRL, Rpt. No. 971, Aberdeen Proving Ground, Md.Google Scholar
  123. Weik, M. H., 1961: A 3rd Survey of Domestic Electronics Digital Computing Systems, BRL, Rpt. No. 971, Aberdeen Proving Ground, Md.Google Scholar
  124. Weir, J. M., 1953: Reliability Sc Characteristics of the ILLIAC Electrostatic Memory, EJCC, pp. 72–77.Google Scholar
  125. Wheeler, D. J., Robertson, J. E., 1953: Diagnostic Programs for the ILLIAC, Proc. IRE, V. 41, pp. 1320–1325.MathSciNetCrossRefGoogle Scholar
  126. Whitelock, L. D., 1953: Methods Used to Improve Reliability in Military Electronics Equipment, EJCC, pp. 31–33.Google Scholar
  127. Wilkes, M. V., Wheeler, D. J., Gill, S., 1951: The Preparing of Programs for an Electronic Digital Computer, Addison-Wesley.Google Scholar

Copyright information

© Springer-Verlag/Wien 1987

Authors and Affiliations

  • W. C. Carter
    • 1
  1. 1.WoodburyUSA

Personalised recommendations