Design Fault Tolerance

Randell, Brian

doi:10.1007/978-3-7091-8871-2_10

Brian Randell⁷

Part of the book series: Dependable Computing and Fault-Tolerant Systems ((DEPENDABLECOMP,volume 1))

99 Accesses
14 Citations

Abstract

The aim of this paper is to provide a personal perspective on the subject of design fault tolerance, and in particular software fault tolerance, as it has developed at Newcastle and elsewhere, and to speculate briefly on how the subject might advance in the future. The principal topics covered are the search for an appropriate set of basic concepts and definitions, the differing styles of fault masking provided by recovery blocks and N-version programs, the growing sophistication of error recovery techniques, particularly in distributed systems, and the problems of assessing the cost/effectiveness of design fault tolerance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

C. Babbage, “On The Mathematical Powers of the Calculating Engine”, (Unpublished Manuscript) Buxton MS7, Museum of the History of Science, Oxford, December 1837, (Printed in The Origins of Digital Computers: Selected Papers (ed. B. Randell) pp. 17–52, Springer, 1974.)
Google Scholar
D. Lardner, “Babbage’s Calculating Engine”, Edinburgh Review, vol. 120, July 1834, (Reprinted in Charles Babbage and his Calculating Engines (eds. P. and E. Morrison) Dover, New York, 1961.)
Google Scholar
S. K. Shrivastava (ed.), “Reliable Computing Systems: Collected papers of the Newcastle Reliability Project”, Springer 1985
Google Scholar
P. M. Melliar-Smith and B. Randell, “Software Reliability: The role of programmed exception handling”, Proc. Conf. on Language Design For Reliable Software, pp. 95–100 Raleigh March 1977, (ACM SIGPLAN Notices, vol. 12, no. 3, March 1977.)
Google Scholar
W. C. Carter, “Hardware Fault Tolerance”, pp. 211–263 Computing System Reliability, ed. T. Anderson and B. Randell, Cambridge Univ. Press 1979
Google Scholar
T. Anderson and P. A. Lee, “Fault Tolerance: Principles and practice”, Prentice-Hall 1981
Google Scholar
J. -C. Laprie, “Dependable Computing and Fault-Tolerance”, Digest of Papers FTCS-15: Fifteenth IEEE Int. Conf. on Fault-Tolerant Computing, pp. 2–11, Ann Arbor, June 1985
Google Scholar
B. Randell and J. E. Dobson, “Reliability and Security Issues in Distributed Computing Systems”, Proc. 5th Symp. on Reliability in Distributed Software and Database Systems, pp. 113–118, IEEE, Los Angeles, January 1986
Google Scholar
J. E. Dobson and B. Randell, “Building Reliable Secure Systems Out of Unreliable Insecure Components”, Proc. Conf, on Security and Privacy, Oakland April 1986
Google Scholar
A. Avizienis, “Design Diversity-The challenge of the eighties”, Digest of Papers, FTCS-12: Twelfth Annual Int. Conf. on Fault-Tolerant Computing, pp. 44–45, IEEE, Santa Monica, 22–24 June 1982
Google Scholar
J. J. Horning, H. C. Lauer, P. M. Melliar-Smith and B. Randell, “A Program Structure for Error Detection and Recovery”, Proc. Conf. on Operating Systems, Theoretical and Practical Aspects, IRIA, Rocquencourt, 23–25 April 1974, (Reprinted in Operating Systems (ed. E. Gelenbe and C. Kaiser), Lecture Notes in Computer Science, Vol. 16, Springer, pp. 171–187, 1974.)
Google Scholar
T. Anderson and R. Kerr, “Recovery Blocks in Action: A system supporting high reliability”, Proc. 2nd Int. Conf. on Software Engineering, pp. 447–457, San Francisco, October 1976
Google Scholar
B. Randell, “System structuring for software fault tolerance”, Proc. Int. Conf. on Reliable Software, pp. 437–449, Los Angeles, 21–23 April 1975, (ACM SIGPLAN Notices, Vol. 10, No. 6, June 1975 )
Google Scholar
A. Avizienis, “Fault-Tolerance and Fault-Intolerance: Complementary approaches to reliable computing”, Proc. Int. Conf. on Reliable Software, pp. 458–464, Los Angeles 21–23 April 1975, (ACM SIGPLAN Notices, Vol. 10, No. 6, June 1975 )
Google Scholar
A. Avizienis and L. Chen, “On the Implementation of N-Version Programming for Software Fault-Tolerance During Program Execution”, Proc. COMPSAC 77, pp. 149–155 (1st IEEE-CS Int. Computer Software and Applications Conference) Chicago, November 1977
Google Scholar
L. Chen and A. Avizienis, “N-Version Programming: A Fault-Tolerance Approach to Reliability of Software Operation”, Digest of Papers FTCS-8: Eighth Annual Conf. on Fault-Tolerant Computing, pp. 3–9, IEEE, June 1978, Toulouse
Google Scholar
K. H. Kim and C. V. Ramamoorthy, “Failure Tolerant Parallel Programming and its Supporting System Architecture”, pp. 413–423, Proc. 1976 NCC, AFIPS, New York June 1976
Google Scholar
T. Anderson, “A Structured Decision Mechanism for Diverse Software”, Proc. 5th Symp. on Reliability in Distributed Software and Database Systems, pp. 125–129, IEEE, Los Angeles, 13–15 January 1986
Google Scholar
F. Cristian, “Exception Handling and Software Fault Tolerance”, IEEE Transactions on Computers, vol. C-31, nr. 6, pp. 531–540, June 1982
Article Google Scholar
F. Cristian, “Robust Data Types”, Acta Informatica, vol. 17, 1982, pp. 365–397
Google Scholar
W. C. Carter and P. R. Schneider, “Design of Dynamically Checked Computers”, Proc. IFIP 68, Edinburgh, 5–10 August 1968, pp. 878–883
Google Scholar
D. J. Taylor, D. E. Morgan and J. P. Black, “Redundancy in Data Structures: Improving software fault tolerance” IEEE Trans, on Software Engineering, vol. SE-6, nr. 6, November 1980, pp. 585–594
Article MathSciNet Google Scholar
D. J. Taylor, D. E. Morgan and J. P. Black, “Redundancy in Data Structures: Some theoretical results”, IEEE Trans, on Software Engineering, vol. SE-6, nr. 6, November 1980, pp. 595–602
Article MathSciNet Google Scholar
J. P. Black and D. J. Taylor, “Local Correctability in Robust Storage Structures”, December 1984, Dept. of Computer Science, University of Waterloo, CS-84–44, (To appear in IEEE Trans, on Software Engineering)
Google Scholar
J. -P. Banatre and S. K. Shrivastava, “Reliable Resource Allocation Between Unreliable Processes”, IEEE Trans, on Software Engineering, vol. SE-4, nr. 3, pp. 230–241, May 1978
Article Google Scholar
S. K. Shrivastava, “Concurrent Pascal with Backward Error Recovery”, Software: Practice and Experience, vol. 9, nr. 12, 1979, pp. 1001–1020
Article MATH Google Scholar
C. T. Davies, “Recovery Semantics for a DB/DC System”, Proc. ACM National Conference, pp. 136–141, Atlanta, August 1973
Google Scholar
C. T. Davies, “Data Processing”, Computing Systems Reliability, Cambridge Univ. Press 1979, ed. T. Anderson and B. Randell, pp. 288–354
Google Scholar
S. K. Shrivastava, “A Dependency, Commitment and Recovery Model for Atomic Actions”, Proc. 2nd Symp. on Reliability in Distributed Software and Database Systems, IEEE, Pittsburgh, 19–21 July 1982, pp. 112–119
Google Scholar
T. Haerder and A. Reuter, “Principles of Transaction-Oriented Database Recovery”, Computing Surveys, vol. 15, nr. 4, pp. 287–317
Google Scholar
R. H. Campbell and B. Randell “Error Recovery in Asynchronous Systems”, Technical Report TRI86, Computing Laboratory, University of Newcastle upon Tyne, July, 1983, ( To appear in IEEE Trans, on Software Engineering )
Google Scholar
L. Lamport, R. Shostak and M. Pease, “The Byzantine Generals Problem”, ACM Trans, on Prog. Lang, and Systems, July 1982, vol. 4, nr. 3, pp. 382–401
Article MATH Google Scholar
E. W. Dijkstra, “Self-Stabilization in Spite of Distributed Control”, Comm. ACM, vol. 17, nr. 11, November 1974, pp. 643–644
Article MATH Google Scholar
T. Anderson and M. R. Moulding, “Dialogues for Recovery Coordination in Concurrent Systems”, (In preparation)
Google Scholar
P. A. Lee, N. Ghani and K. Heron, “A Recovery Cache for the PDP-11”, IEEE Trans. Computers, vol. C-29, nr. 6, pp. 546–549, June 1980
Article Google Scholar
T. Anderson, P. A. Barrett, D. N. Halliwell and M. R. Moulding, “Software Fault Tolerance: An evaluation”, IEEE Trans. Software Engineering, vol. SE-11, nr. 12, pp. 1502–1510, December 1985
Article Google Scholar
A. Avizienis, P. Gunnenberg, J. P. J. Kelly, L. Strigini, P. J. Traverse, K. S. Tso and U. Voges, “The UCLA DEDIX System: A distributed testbed for multiple-version software”, Digest of Papers, FTCS-15: Fifteenth Annual Int. Conf. on Fault-Tolerant Computing, IEEE, Ann Arbor, 19–21 June 1985, pp. 126–134
Google Scholar
A. Avizienis and J. P. J. Kelly, “Fault Tolerance by Design Diversity: Concepts and experiments”, IEEE Computer, August 1984, pp. 67–80
Google Scholar
J. C. Knight, N. G. Leveson and L. D. St. Jean, “A Large Scale Experiment in N-Version Programming”, Digest of Papers, FTCS-15: Fifteenth Annual Int. Conf. on Fault-Tolerant Computing IEEE, 19–21 June 1985, Ann Arbor MI
Google Scholar
H. Hecht, “Fault Tolerant Software for Real-Time Applications”, ACM Computing Surveys, vol. 8, nr. 4, December 1976, pp. 391–407
Article MATH Google Scholar
H. D. Welch, “Distributed Recovery Block Performance in a Real-Time Control Loop”, Proc. Real Time Systems Symp., pp. 268–276, Arlington 1983
Google Scholar
J. R. Garman, “The “Bug” Heard Round the World”, ACM Software Engineering Notes, vol. 6, nr. 5, pp. 3–10, October 1981
Article Google Scholar
D. J. Martin, “Dissimilar Software in High Integrity Applications in Flight Control”, Software Avionics, AGARD Conf. Proc. No 300, pp. 36. 1–36. 9, January 1983
Google Scholar
O. B. Von Linde, “Computers Can Now Perform Vital Functions Safely”, Railway Gazette International, pp. 1004–1006, November 1979
Google Scholar
R. D. Schlichting and F. B. Schneider, “Fail-Stop Processors: An approach to designing fault-tolerant computing systems”, ACM Trans. Computer Systems, vol. 1, nr. 3, pp. 222–238, August 1983
Article Google Scholar
T. Gilb, “Parallel Programming”, Datamation, vol. 20, nr. 10, pp. 160–161, October 1974
Google Scholar
E. Best and F. Cristian, “Systematic Detection of Exception Occurrences”, Science of Computer Programming, vol. 1, nr. 1, pp. 115–144, North-Holland, 1981
Article MATH Google Scholar
T. Anderson and R. W. Witty, “Safe Programming”, BIT, vol. 18, pp. 1–8, 1978
Article MATH Google Scholar
R. H. Campbell, A. Koelmans and M. R. McLauchlan, “STRICT-A Design Language for Strongly Typed Recursive Integrated Circuits”, Proc. IEE, March/April 1985, vol. 132, Pts E and I, nr. 2, pp. 108–115
Google Scholar

Download references

Author information

Authors and Affiliations

Computing Laboratory, University of Newcastle upon Tyne, UK
Brian Randell

Authors

Brian Randell
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

UCLA, Los Angeles, Calif., USA
Algirdas Avižienis
Technical University, Wien, Austria
Hermann Kopetz
LAAS, Toulouse, France
Jean-Claude Laprie

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Randell, B. (1987). Design Fault Tolerance. In: Avižienis, A., Kopetz, H., Laprie, JC. (eds) The Evolution of Fault-Tolerant Computing. Dependable Computing and Fault-Tolerant Systems, vol 1. Springer, Vienna. https://doi.org/10.1007/978-3-7091-8871-2_10

Download citation

DOI: https://doi.org/10.1007/978-3-7091-8871-2_10
Publisher Name: Springer, Vienna
Print ISBN: 978-3-7091-8873-6
Online ISBN: 978-3-7091-8871-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics