Skip to main content

Part of the book series: Dependable Computing and Fault-Tolerant Systems ((DEPENDABLECOMP,volume 1))

Abstract

The aim of this paper is to provide a personal perspective on the subject of design fault tolerance, and in particular software fault tolerance, as it has developed at Newcastle and elsewhere, and to speculate briefly on how the subject might advance in the future. The principal topics covered are the search for an appropriate set of basic concepts and definitions, the differing styles of fault masking provided by recovery blocks and N-version programs, the growing sophistication of error recovery techniques, particularly in distributed systems, and the problems of assessing the cost/effectiveness of design fault tolerance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. C. Babbage, “On The Mathematical Powers of the Calculating Engine”, (Unpublished Manuscript) Buxton MS7, Museum of the History of Science, Oxford, December 1837, (Printed in The Origins of Digital Computers: Selected Papers (ed. B. Randell) pp. 17–52, Springer, 1974.)

    Google Scholar 

  2. D. Lardner, “Babbage’s Calculating Engine”, Edinburgh Review, vol. 120, July 1834, (Reprinted in Charles Babbage and his Calculating Engines (eds. P. and E. Morrison) Dover, New York, 1961.)

    Google Scholar 

  3. S. K. Shrivastava (ed.), “Reliable Computing Systems: Collected papers of the Newcastle Reliability Project”, Springer 1985

    Google Scholar 

  4. P. M. Melliar-Smith and B. Randell, “Software Reliability: The role of programmed exception handling”, Proc. Conf. on Language Design For Reliable Software, pp. 95–100 Raleigh March 1977, (ACM SIGPLAN Notices, vol. 12, no. 3, March 1977.)

    Google Scholar 

  5. W. C. Carter, “Hardware Fault Tolerance”, pp. 211–263 Computing System Reliability, ed. T. Anderson and B. Randell, Cambridge Univ. Press 1979

    Google Scholar 

  6. T. Anderson and P. A. Lee, “Fault Tolerance: Principles and practice”, Prentice-Hall 1981

    Google Scholar 

  7. J. -C. Laprie, “Dependable Computing and Fault-Tolerance”, Digest of Papers FTCS-15: Fifteenth IEEE Int. Conf. on Fault-Tolerant Computing, pp. 2–11, Ann Arbor, June 1985

    Google Scholar 

  8. B. Randell and J. E. Dobson, “Reliability and Security Issues in Distributed Computing Systems”, Proc. 5th Symp. on Reliability in Distributed Software and Database Systems, pp. 113–118, IEEE, Los Angeles, January 1986

    Google Scholar 

  9. J. E. Dobson and B. Randell, “Building Reliable Secure Systems Out of Unreliable Insecure Components”, Proc. Conf, on Security and Privacy, Oakland April 1986

    Google Scholar 

  10. A. Avizienis, “Design Diversity-The challenge of the eighties”, Digest of Papers, FTCS-12: Twelfth Annual Int. Conf. on Fault-Tolerant Computing, pp. 44–45, IEEE, Santa Monica, 22–24 June 1982

    Google Scholar 

  11. J. J. Horning, H. C. Lauer, P. M. Melliar-Smith and B. Randell, “A Program Structure for Error Detection and Recovery”, Proc. Conf. on Operating Systems, Theoretical and Practical Aspects, IRIA, Rocquencourt, 23–25 April 1974, (Reprinted in Operating Systems (ed. E. Gelenbe and C. Kaiser), Lecture Notes in Computer Science, Vol. 16, Springer, pp. 171–187, 1974.)

    Google Scholar 

  12. T. Anderson and R. Kerr, “Recovery Blocks in Action: A system supporting high reliability”, Proc. 2nd Int. Conf. on Software Engineering, pp. 447–457, San Francisco, October 1976

    Google Scholar 

  13. B. Randell, “System structuring for software fault tolerance”, Proc. Int. Conf. on Reliable Software, pp. 437–449, Los Angeles, 21–23 April 1975, (ACM SIGPLAN Notices, Vol. 10, No. 6, June 1975 )

    Google Scholar 

  14. A. Avizienis, “Fault-Tolerance and Fault-Intolerance: Complementary approaches to reliable computing”, Proc. Int. Conf. on Reliable Software, pp. 458–464, Los Angeles 21–23 April 1975, (ACM SIGPLAN Notices, Vol. 10, No. 6, June 1975 )

    Google Scholar 

  15. A. Avizienis and L. Chen, “On the Implementation of N-Version Programming for Software Fault-Tolerance During Program Execution”, Proc. COMPSAC 77, pp. 149–155 (1st IEEE-CS Int. Computer Software and Applications Conference) Chicago, November 1977

    Google Scholar 

  16. L. Chen and A. Avizienis, “N-Version Programming: A Fault-Tolerance Approach to Reliability of Software Operation”, Digest of Papers FTCS-8: Eighth Annual Conf. on Fault-Tolerant Computing, pp. 3–9, IEEE, June 1978, Toulouse

    Google Scholar 

  17. K. H. Kim and C. V. Ramamoorthy, “Failure Tolerant Parallel Programming and its Supporting System Architecture”, pp. 413–423, Proc. 1976 NCC, AFIPS, New York June 1976

    Google Scholar 

  18. T. Anderson, “A Structured Decision Mechanism for Diverse Software”, Proc. 5th Symp. on Reliability in Distributed Software and Database Systems, pp. 125–129, IEEE, Los Angeles, 13–15 January 1986

    Google Scholar 

  19. F. Cristian, “Exception Handling and Software Fault Tolerance”, IEEE Transactions on Computers, vol. C-31, nr. 6, pp. 531–540, June 1982

    Article  Google Scholar 

  20. F. Cristian, “Robust Data Types”, Acta Informatica, vol. 17, 1982, pp. 365–397

    Google Scholar 

  21. W. C. Carter and P. R. Schneider, “Design of Dynamically Checked Computers”, Proc. IFIP 68, Edinburgh, 5–10 August 1968, pp. 878–883

    Google Scholar 

  22. D. J. Taylor, D. E. Morgan and J. P. Black, “Redundancy in Data Structures: Improving software fault tolerance” IEEE Trans, on Software Engineering, vol. SE-6, nr. 6, November 1980, pp. 585–594

    Article  MathSciNet  Google Scholar 

  23. D. J. Taylor, D. E. Morgan and J. P. Black, “Redundancy in Data Structures: Some theoretical results”, IEEE Trans, on Software Engineering, vol. SE-6, nr. 6, November 1980, pp. 595–602

    Article  MathSciNet  Google Scholar 

  24. J. P. Black and D. J. Taylor, “Local Correctability in Robust Storage Structures”, December 1984, Dept. of Computer Science, University of Waterloo, CS-84–44, (To appear in IEEE Trans, on Software Engineering)

    Google Scholar 

  25. J. -P. Banatre and S. K. Shrivastava, “Reliable Resource Allocation Between Unreliable Processes”, IEEE Trans, on Software Engineering, vol. SE-4, nr. 3, pp. 230–241, May 1978

    Article  Google Scholar 

  26. S. K. Shrivastava, “Concurrent Pascal with Backward Error Recovery”, Software: Practice and Experience, vol. 9, nr. 12, 1979, pp. 1001–1020

    Article  MATH  Google Scholar 

  27. C. T. Davies, “Recovery Semantics for a DB/DC System”, Proc. ACM National Conference, pp. 136–141, Atlanta, August 1973

    Google Scholar 

  28. C. T. Davies, “Data Processing”, Computing Systems Reliability, Cambridge Univ. Press 1979, ed. T. Anderson and B. Randell, pp. 288–354

    Google Scholar 

  29. S. K. Shrivastava, “A Dependency, Commitment and Recovery Model for Atomic Actions”, Proc. 2nd Symp. on Reliability in Distributed Software and Database Systems, IEEE, Pittsburgh, 19–21 July 1982, pp. 112–119

    Google Scholar 

  30. T. Haerder and A. Reuter, “Principles of Transaction-Oriented Database Recovery”, Computing Surveys, vol. 15, nr. 4, pp. 287–317

    Google Scholar 

  31. R. H. Campbell and B. Randell “Error Recovery in Asynchronous Systems”, Technical Report TRI86, Computing Laboratory, University of Newcastle upon Tyne, July, 1983, ( To appear in IEEE Trans, on Software Engineering )

    Google Scholar 

  32. L. Lamport, R. Shostak and M. Pease, “The Byzantine Generals Problem”, ACM Trans, on Prog. Lang, and Systems, July 1982, vol. 4, nr. 3, pp. 382–401

    Article  MATH  Google Scholar 

  33. E. W. Dijkstra, “Self-Stabilization in Spite of Distributed Control”, Comm. ACM, vol. 17, nr. 11, November 1974, pp. 643–644

    Article  MATH  Google Scholar 

  34. T. Anderson and M. R. Moulding, “Dialogues for Recovery Coordination in Concurrent Systems”, (In preparation)

    Google Scholar 

  35. P. A. Lee, N. Ghani and K. Heron, “A Recovery Cache for the PDP-11”, IEEE Trans. Computers, vol. C-29, nr. 6, pp. 546–549, June 1980

    Article  Google Scholar 

  36. T. Anderson, P. A. Barrett, D. N. Halliwell and M. R. Moulding, “Software Fault Tolerance: An evaluation”, IEEE Trans. Software Engineering, vol. SE-11, nr. 12, pp. 1502–1510, December 1985

    Article  Google Scholar 

  37. A. Avizienis, P. Gunnenberg, J. P. J. Kelly, L. Strigini, P. J. Traverse, K. S. Tso and U. Voges, “The UCLA DEDIX System: A distributed testbed for multiple-version software”, Digest of Papers, FTCS-15: Fifteenth Annual Int. Conf. on Fault-Tolerant Computing, IEEE, Ann Arbor, 19–21 June 1985, pp. 126–134

    Google Scholar 

  38. A. Avizienis and J. P. J. Kelly, “Fault Tolerance by Design Diversity: Concepts and experiments”, IEEE Computer, August 1984, pp. 67–80

    Google Scholar 

  39. J. C. Knight, N. G. Leveson and L. D. St. Jean, “A Large Scale Experiment in N-Version Programming”, Digest of Papers, FTCS-15: Fifteenth Annual Int. Conf. on Fault-Tolerant Computing IEEE, 19–21 June 1985, Ann Arbor MI

    Google Scholar 

  40. H. Hecht, “Fault Tolerant Software for Real-Time Applications”, ACM Computing Surveys, vol. 8, nr. 4, December 1976, pp. 391–407

    Article  MATH  Google Scholar 

  41. H. D. Welch, “Distributed Recovery Block Performance in a Real-Time Control Loop”, Proc. Real Time Systems Symp., pp. 268–276, Arlington 1983

    Google Scholar 

  42. J. R. Garman, “The “Bug” Heard Round the World”, ACM Software Engineering Notes, vol. 6, nr. 5, pp. 3–10, October 1981

    Article  Google Scholar 

  43. D. J. Martin, “Dissimilar Software in High Integrity Applications in Flight Control”, Software Avionics, AGARD Conf. Proc. No 300, pp. 36. 1–36. 9, January 1983

    Google Scholar 

  44. O. B. Von Linde, “Computers Can Now Perform Vital Functions Safely”, Railway Gazette International, pp. 1004–1006, November 1979

    Google Scholar 

  45. R. D. Schlichting and F. B. Schneider, “Fail-Stop Processors: An approach to designing fault-tolerant computing systems”, ACM Trans. Computer Systems, vol. 1, nr. 3, pp. 222–238, August 1983

    Article  Google Scholar 

  46. T. Gilb, “Parallel Programming”, Datamation, vol. 20, nr. 10, pp. 160–161, October 1974

    Google Scholar 

  47. E. Best and F. Cristian, “Systematic Detection of Exception Occurrences”, Science of Computer Programming, vol. 1, nr. 1, pp. 115–144, North-Holland, 1981

    Article  MATH  Google Scholar 

  48. T. Anderson and R. W. Witty, “Safe Programming”, BIT, vol. 18, pp. 1–8, 1978

    Article  MATH  Google Scholar 

  49. R. H. Campbell, A. Koelmans and M. R. McLauchlan, “STRICT-A Design Language for Strongly Typed Recursive Integrated Circuits”, Proc. IEE, March/April 1985, vol. 132, Pts E and I, nr. 2, pp. 108–115

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1987 Springer-Verlag/Wien

About this paper

Cite this paper

Randell, B. (1987). Design Fault Tolerance. In: Avižienis, A., Kopetz, H., Laprie, JC. (eds) The Evolution of Fault-Tolerant Computing. Dependable Computing and Fault-Tolerant Systems, vol 1. Springer, Vienna. https://doi.org/10.1007/978-3-7091-8871-2_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-7091-8871-2_10

  • Publisher Name: Springer, Vienna

  • Print ISBN: 978-3-7091-8873-6

  • Online ISBN: 978-3-7091-8871-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics