Design Fault Tolerance

  • Brian Randell
Part of the Dependable Computing and Fault-Tolerant Systems book series (DEPENDABLECOMP, volume 1)


The aim of this paper is to provide a personal perspective on the subject of design fault tolerance, and in particular software fault tolerance, as it has developed at Newcastle and elsewhere, and to speculate briefly on how the subject might advance in the future. The principal topics covered are the search for an appropriate set of basic concepts and definitions, the differing styles of fault masking provided by recovery blocks and N-version programs, the growing sophistication of error recovery techniques, particularly in distributed systems, and the problems of assessing the cost/effectiveness of design fault tolerance.


Formal Verification Design Fault Error Recovery Exception Handling Erroneous State 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    C. Babbage, “On The Mathematical Powers of the Calculating Engine”, (Unpublished Manuscript) Buxton MS7, Museum of the History of Science, Oxford, December 1837, (Printed in The Origins of Digital Computers: Selected Papers (ed. B. Randell) pp. 17–52, Springer, 1974.)Google Scholar
  2. [2]
    D. Lardner, “Babbage’s Calculating Engine”, Edinburgh Review, vol. 120, July 1834, (Reprinted in Charles Babbage and his Calculating Engines (eds. P. and E. Morrison) Dover, New York, 1961.)Google Scholar
  3. [3]
    S. K. Shrivastava (ed.), “Reliable Computing Systems: Collected papers of the Newcastle Reliability Project”, Springer 1985Google Scholar
  4. [4]
    P. M. Melliar-Smith and B. Randell, “Software Reliability: The role of programmed exception handling”, Proc. Conf. on Language Design For Reliable Software, pp. 95–100 Raleigh March 1977, (ACM SIGPLAN Notices, vol. 12, no. 3, March 1977.)Google Scholar
  5. [5]
    W. C. Carter, “Hardware Fault Tolerance”, pp. 211–263 Computing System Reliability, ed. T. Anderson and B. Randell, Cambridge Univ. Press 1979Google Scholar
  6. [6]
    T. Anderson and P. A. Lee, “Fault Tolerance: Principles and practice”, Prentice-Hall 1981Google Scholar
  7. [7]
    J. -C. Laprie, “Dependable Computing and Fault-Tolerance”, Digest of Papers FTCS-15: Fifteenth IEEE Int. Conf. on Fault-Tolerant Computing, pp. 2–11, Ann Arbor, June 1985Google Scholar
  8. [8]
    B. Randell and J. E. Dobson, “Reliability and Security Issues in Distributed Computing Systems”, Proc. 5th Symp. on Reliability in Distributed Software and Database Systems, pp. 113–118, IEEE, Los Angeles, January 1986Google Scholar
  9. [9]
    J. E. Dobson and B. Randell, “Building Reliable Secure Systems Out of Unreliable Insecure Components”, Proc. Conf, on Security and Privacy, Oakland April 1986Google Scholar
  10. [10]
    A. Avizienis, “Design Diversity-The challenge of the eighties”, Digest of Papers, FTCS-12: Twelfth Annual Int. Conf. on Fault-Tolerant Computing, pp. 44–45, IEEE, Santa Monica, 22–24 June 1982Google Scholar
  11. [11]
    J. J. Horning, H. C. Lauer, P. M. Melliar-Smith and B. Randell, “A Program Structure for Error Detection and Recovery”, Proc. Conf. on Operating Systems, Theoretical and Practical Aspects, IRIA, Rocquencourt, 23–25 April 1974, (Reprinted in Operating Systems (ed. E. Gelenbe and C. Kaiser), Lecture Notes in Computer Science, Vol. 16, Springer, pp. 171–187, 1974.)Google Scholar
  12. [12]
    T. Anderson and R. Kerr, “Recovery Blocks in Action: A system supporting high reliability”, Proc. 2nd Int. Conf. on Software Engineering, pp. 447–457, San Francisco, October 1976Google Scholar
  13. [13]
    B. Randell, “System structuring for software fault tolerance”, Proc. Int. Conf. on Reliable Software, pp. 437–449, Los Angeles, 21–23 April 1975, (ACM SIGPLAN Notices, Vol. 10, No. 6, June 1975 )Google Scholar
  14. [14]
    A. Avizienis, “Fault-Tolerance and Fault-Intolerance: Complementary approaches to reliable computing”, Proc. Int. Conf. on Reliable Software, pp. 458–464, Los Angeles 21–23 April 1975, (ACM SIGPLAN Notices, Vol. 10, No. 6, June 1975 )Google Scholar
  15. [15]
    A. Avizienis and L. Chen, “On the Implementation of N-Version Programming for Software Fault-Tolerance During Program Execution”, Proc. COMPSAC 77, pp. 149–155 (1st IEEE-CS Int. Computer Software and Applications Conference) Chicago, November 1977Google Scholar
  16. [16]
    L. Chen and A. Avizienis, “N-Version Programming: A Fault-Tolerance Approach to Reliability of Software Operation”, Digest of Papers FTCS-8: Eighth Annual Conf. on Fault-Tolerant Computing, pp. 3–9, IEEE, June 1978, ToulouseGoogle Scholar
  17. [17]
    K. H. Kim and C. V. Ramamoorthy, “Failure Tolerant Parallel Programming and its Supporting System Architecture”, pp. 413–423, Proc. 1976 NCC, AFIPS, New York June 1976Google Scholar
  18. [18]
    T. Anderson, “A Structured Decision Mechanism for Diverse Software”, Proc. 5th Symp. on Reliability in Distributed Software and Database Systems, pp. 125–129, IEEE, Los Angeles, 13–15 January 1986Google Scholar
  19. [19]
    F. Cristian, “Exception Handling and Software Fault Tolerance”, IEEE Transactions on Computers, vol. C-31, nr. 6, pp. 531–540, June 1982CrossRefGoogle Scholar
  20. [20]
    F. Cristian, “Robust Data Types”, Acta Informatica, vol. 17, 1982, pp. 365–397Google Scholar
  21. [21]
    W. C. Carter and P. R. Schneider, “Design of Dynamically Checked Computers”, Proc. IFIP 68, Edinburgh, 5–10 August 1968, pp. 878–883Google Scholar
  22. [22]
    D. J. Taylor, D. E. Morgan and J. P. Black, “Redundancy in Data Structures: Improving software fault tolerance” IEEE Trans, on Software Engineering, vol. SE-6, nr. 6, November 1980, pp. 585–594CrossRefMathSciNetGoogle Scholar
  23. [23]
    D. J. Taylor, D. E. Morgan and J. P. Black, “Redundancy in Data Structures: Some theoretical results”, IEEE Trans, on Software Engineering, vol. SE-6, nr. 6, November 1980, pp. 595–602CrossRefMathSciNetGoogle Scholar
  24. [24]
    J. P. Black and D. J. Taylor, “Local Correctability in Robust Storage Structures”, December 1984, Dept. of Computer Science, University of Waterloo, CS-84–44, (To appear in IEEE Trans, on Software Engineering)Google Scholar
  25. [25]
    J. -P. Banatre and S. K. Shrivastava, “Reliable Resource Allocation Between Unreliable Processes”, IEEE Trans, on Software Engineering, vol. SE-4, nr. 3, pp. 230–241, May 1978CrossRefGoogle Scholar
  26. [26]
    S. K. Shrivastava, “Concurrent Pascal with Backward Error Recovery”, Software: Practice and Experience, vol. 9, nr. 12, 1979, pp. 1001–1020CrossRefMATHGoogle Scholar
  27. [27]
    C. T. Davies, “Recovery Semantics for a DB/DC System”, Proc. ACM National Conference, pp. 136–141, Atlanta, August 1973Google Scholar
  28. [28]
    C. T. Davies, “Data Processing”, Computing Systems Reliability, Cambridge Univ. Press 1979, ed. T. Anderson and B. Randell, pp. 288–354Google Scholar
  29. [29]
    S. K. Shrivastava, “A Dependency, Commitment and Recovery Model for Atomic Actions”, Proc. 2nd Symp. on Reliability in Distributed Software and Database Systems, IEEE, Pittsburgh, 19–21 July 1982, pp. 112–119Google Scholar
  30. [30]
    T. Haerder and A. Reuter, “Principles of Transaction-Oriented Database Recovery”, Computing Surveys, vol. 15, nr. 4, pp. 287–317Google Scholar
  31. [31]
    R. H. Campbell and B. Randell “Error Recovery in Asynchronous Systems”, Technical Report TRI86, Computing Laboratory, University of Newcastle upon Tyne, July, 1983, ( To appear in IEEE Trans, on Software Engineering )Google Scholar
  32. [32]
    L. Lamport, R. Shostak and M. Pease, “The Byzantine Generals Problem”, ACM Trans, on Prog. Lang, and Systems, July 1982, vol. 4, nr. 3, pp. 382–401CrossRefMATHGoogle Scholar
  33. [33]
    E. W. Dijkstra, “Self-Stabilization in Spite of Distributed Control”, Comm. ACM, vol. 17, nr. 11, November 1974, pp. 643–644CrossRefMATHGoogle Scholar
  34. [34]
    T. Anderson and M. R. Moulding, “Dialogues for Recovery Coordination in Concurrent Systems”, (In preparation)Google Scholar
  35. [35]
    P. A. Lee, N. Ghani and K. Heron, “A Recovery Cache for the PDP-11”, IEEE Trans. Computers, vol. C-29, nr. 6, pp. 546–549, June 1980CrossRefGoogle Scholar
  36. [36]
    T. Anderson, P. A. Barrett, D. N. Halliwell and M. R. Moulding, “Software Fault Tolerance: An evaluation”, IEEE Trans. Software Engineering, vol. SE-11, nr. 12, pp. 1502–1510, December 1985CrossRefGoogle Scholar
  37. [37]
    A. Avizienis, P. Gunnenberg, J. P. J. Kelly, L. Strigini, P. J. Traverse, K. S. Tso and U. Voges, “The UCLA DEDIX System: A distributed testbed for multiple-version software”, Digest of Papers, FTCS-15: Fifteenth Annual Int. Conf. on Fault-Tolerant Computing, IEEE, Ann Arbor, 19–21 June 1985, pp. 126–134Google Scholar
  38. [38]
    A. Avizienis and J. P. J. Kelly, “Fault Tolerance by Design Diversity: Concepts and experiments”, IEEE Computer, August 1984, pp. 67–80Google Scholar
  39. [39]
    J. C. Knight, N. G. Leveson and L. D. St. Jean, “A Large Scale Experiment in N-Version Programming”, Digest of Papers, FTCS-15: Fifteenth Annual Int. Conf. on Fault-Tolerant Computing IEEE, 19–21 June 1985, Ann Arbor MIGoogle Scholar
  40. [40]
    H. Hecht, “Fault Tolerant Software for Real-Time Applications”, ACM Computing Surveys, vol. 8, nr. 4, December 1976, pp. 391–407CrossRefMATHGoogle Scholar
  41. [41]
    H. D. Welch, “Distributed Recovery Block Performance in a Real-Time Control Loop”, Proc. Real Time Systems Symp., pp. 268–276, Arlington 1983Google Scholar
  42. [42]
    J. R. Garman, “The “Bug” Heard Round the World”, ACM Software Engineering Notes, vol. 6, nr. 5, pp. 3–10, October 1981CrossRefGoogle Scholar
  43. [43]
    D. J. Martin, “Dissimilar Software in High Integrity Applications in Flight Control”, Software Avionics, AGARD Conf. Proc. No 300, pp. 36. 1–36. 9, January 1983Google Scholar
  44. [44]
    O. B. Von Linde, “Computers Can Now Perform Vital Functions Safely”, Railway Gazette International, pp. 1004–1006, November 1979Google Scholar
  45. [45]
    R. D. Schlichting and F. B. Schneider, “Fail-Stop Processors: An approach to designing fault-tolerant computing systems”, ACM Trans. Computer Systems, vol. 1, nr. 3, pp. 222–238, August 1983CrossRefGoogle Scholar
  46. [46]
    T. Gilb, “Parallel Programming”, Datamation, vol. 20, nr. 10, pp. 160–161, October 1974Google Scholar
  47. [47]
    E. Best and F. Cristian, “Systematic Detection of Exception Occurrences”, Science of Computer Programming, vol. 1, nr. 1, pp. 115–144, North-Holland, 1981CrossRefMATHGoogle Scholar
  48. [48]
    T. Anderson and R. W. Witty, “Safe Programming”, BIT, vol. 18, pp. 1–8, 1978CrossRefMATHGoogle Scholar
  49. [49]
    R. H. Campbell, A. Koelmans and M. R. McLauchlan, “STRICT-A Design Language for Strongly Typed Recursive Integrated Circuits”, Proc. IEE, March/April 1985, vol. 132, Pts E and I, nr. 2, pp. 108–115Google Scholar

Copyright information

© Springer-Verlag/Wien 1987

Authors and Affiliations

  • Brian Randell
    • 1
  1. 1.Computing LaboratoryUniversity of Newcastle upon TyneUK

Personalised recommendations