Abstract
The aim of this paper is to provide a personal perspective on the subject of design fault tolerance, and in particular software fault tolerance, as it has developed at Newcastle and elsewhere, and to speculate briefly on how the subject might advance in the future. The principal topics covered are the search for an appropriate set of basic concepts and definitions, the differing styles of fault masking provided by recovery blocks and N-version programs, the growing sophistication of error recovery techniques, particularly in distributed systems, and the problems of assessing the cost/effectiveness of design fault tolerance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
C. Babbage, “On The Mathematical Powers of the Calculating Engine”, (Unpublished Manuscript) Buxton MS7, Museum of the History of Science, Oxford, December 1837, (Printed in The Origins of Digital Computers: Selected Papers (ed. B. Randell) pp. 17–52, Springer, 1974.)
D. Lardner, “Babbage’s Calculating Engine”, Edinburgh Review, vol. 120, July 1834, (Reprinted in Charles Babbage and his Calculating Engines (eds. P. and E. Morrison) Dover, New York, 1961.)
S. K. Shrivastava (ed.), “Reliable Computing Systems: Collected papers of the Newcastle Reliability Project”, Springer 1985
P. M. Melliar-Smith and B. Randell, “Software Reliability: The role of programmed exception handling”, Proc. Conf. on Language Design For Reliable Software, pp. 95–100 Raleigh March 1977, (ACM SIGPLAN Notices, vol. 12, no. 3, March 1977.)
W. C. Carter, “Hardware Fault Tolerance”, pp. 211–263 Computing System Reliability, ed. T. Anderson and B. Randell, Cambridge Univ. Press 1979
T. Anderson and P. A. Lee, “Fault Tolerance: Principles and practice”, Prentice-Hall 1981
J. -C. Laprie, “Dependable Computing and Fault-Tolerance”, Digest of Papers FTCS-15: Fifteenth IEEE Int. Conf. on Fault-Tolerant Computing, pp. 2–11, Ann Arbor, June 1985
B. Randell and J. E. Dobson, “Reliability and Security Issues in Distributed Computing Systems”, Proc. 5th Symp. on Reliability in Distributed Software and Database Systems, pp. 113–118, IEEE, Los Angeles, January 1986
J. E. Dobson and B. Randell, “Building Reliable Secure Systems Out of Unreliable Insecure Components”, Proc. Conf, on Security and Privacy, Oakland April 1986
A. Avizienis, “Design Diversity-The challenge of the eighties”, Digest of Papers, FTCS-12: Twelfth Annual Int. Conf. on Fault-Tolerant Computing, pp. 44–45, IEEE, Santa Monica, 22–24 June 1982
J. J. Horning, H. C. Lauer, P. M. Melliar-Smith and B. Randell, “A Program Structure for Error Detection and Recovery”, Proc. Conf. on Operating Systems, Theoretical and Practical Aspects, IRIA, Rocquencourt, 23–25 April 1974, (Reprinted in Operating Systems (ed. E. Gelenbe and C. Kaiser), Lecture Notes in Computer Science, Vol. 16, Springer, pp. 171–187, 1974.)
T. Anderson and R. Kerr, “Recovery Blocks in Action: A system supporting high reliability”, Proc. 2nd Int. Conf. on Software Engineering, pp. 447–457, San Francisco, October 1976
B. Randell, “System structuring for software fault tolerance”, Proc. Int. Conf. on Reliable Software, pp. 437–449, Los Angeles, 21–23 April 1975, (ACM SIGPLAN Notices, Vol. 10, No. 6, June 1975 )
A. Avizienis, “Fault-Tolerance and Fault-Intolerance: Complementary approaches to reliable computing”, Proc. Int. Conf. on Reliable Software, pp. 458–464, Los Angeles 21–23 April 1975, (ACM SIGPLAN Notices, Vol. 10, No. 6, June 1975 )
A. Avizienis and L. Chen, “On the Implementation of N-Version Programming for Software Fault-Tolerance During Program Execution”, Proc. COMPSAC 77, pp. 149–155 (1st IEEE-CS Int. Computer Software and Applications Conference) Chicago, November 1977
L. Chen and A. Avizienis, “N-Version Programming: A Fault-Tolerance Approach to Reliability of Software Operation”, Digest of Papers FTCS-8: Eighth Annual Conf. on Fault-Tolerant Computing, pp. 3–9, IEEE, June 1978, Toulouse
K. H. Kim and C. V. Ramamoorthy, “Failure Tolerant Parallel Programming and its Supporting System Architecture”, pp. 413–423, Proc. 1976 NCC, AFIPS, New York June 1976
T. Anderson, “A Structured Decision Mechanism for Diverse Software”, Proc. 5th Symp. on Reliability in Distributed Software and Database Systems, pp. 125–129, IEEE, Los Angeles, 13–15 January 1986
F. Cristian, “Exception Handling and Software Fault Tolerance”, IEEE Transactions on Computers, vol. C-31, nr. 6, pp. 531–540, June 1982
F. Cristian, “Robust Data Types”, Acta Informatica, vol. 17, 1982, pp. 365–397
W. C. Carter and P. R. Schneider, “Design of Dynamically Checked Computers”, Proc. IFIP 68, Edinburgh, 5–10 August 1968, pp. 878–883
D. J. Taylor, D. E. Morgan and J. P. Black, “Redundancy in Data Structures: Improving software fault tolerance” IEEE Trans, on Software Engineering, vol. SE-6, nr. 6, November 1980, pp. 585–594
D. J. Taylor, D. E. Morgan and J. P. Black, “Redundancy in Data Structures: Some theoretical results”, IEEE Trans, on Software Engineering, vol. SE-6, nr. 6, November 1980, pp. 595–602
J. P. Black and D. J. Taylor, “Local Correctability in Robust Storage Structures”, December 1984, Dept. of Computer Science, University of Waterloo, CS-84–44, (To appear in IEEE Trans, on Software Engineering)
J. -P. Banatre and S. K. Shrivastava, “Reliable Resource Allocation Between Unreliable Processes”, IEEE Trans, on Software Engineering, vol. SE-4, nr. 3, pp. 230–241, May 1978
S. K. Shrivastava, “Concurrent Pascal with Backward Error Recovery”, Software: Practice and Experience, vol. 9, nr. 12, 1979, pp. 1001–1020
C. T. Davies, “Recovery Semantics for a DB/DC System”, Proc. ACM National Conference, pp. 136–141, Atlanta, August 1973
C. T. Davies, “Data Processing”, Computing Systems Reliability, Cambridge Univ. Press 1979, ed. T. Anderson and B. Randell, pp. 288–354
S. K. Shrivastava, “A Dependency, Commitment and Recovery Model for Atomic Actions”, Proc. 2nd Symp. on Reliability in Distributed Software and Database Systems, IEEE, Pittsburgh, 19–21 July 1982, pp. 112–119
T. Haerder and A. Reuter, “Principles of Transaction-Oriented Database Recovery”, Computing Surveys, vol. 15, nr. 4, pp. 287–317
R. H. Campbell and B. Randell “Error Recovery in Asynchronous Systems”, Technical Report TRI86, Computing Laboratory, University of Newcastle upon Tyne, July, 1983, ( To appear in IEEE Trans, on Software Engineering )
L. Lamport, R. Shostak and M. Pease, “The Byzantine Generals Problem”, ACM Trans, on Prog. Lang, and Systems, July 1982, vol. 4, nr. 3, pp. 382–401
E. W. Dijkstra, “Self-Stabilization in Spite of Distributed Control”, Comm. ACM, vol. 17, nr. 11, November 1974, pp. 643–644
T. Anderson and M. R. Moulding, “Dialogues for Recovery Coordination in Concurrent Systems”, (In preparation)
P. A. Lee, N. Ghani and K. Heron, “A Recovery Cache for the PDP-11”, IEEE Trans. Computers, vol. C-29, nr. 6, pp. 546–549, June 1980
T. Anderson, P. A. Barrett, D. N. Halliwell and M. R. Moulding, “Software Fault Tolerance: An evaluation”, IEEE Trans. Software Engineering, vol. SE-11, nr. 12, pp. 1502–1510, December 1985
A. Avizienis, P. Gunnenberg, J. P. J. Kelly, L. Strigini, P. J. Traverse, K. S. Tso and U. Voges, “The UCLA DEDIX System: A distributed testbed for multiple-version software”, Digest of Papers, FTCS-15: Fifteenth Annual Int. Conf. on Fault-Tolerant Computing, IEEE, Ann Arbor, 19–21 June 1985, pp. 126–134
A. Avizienis and J. P. J. Kelly, “Fault Tolerance by Design Diversity: Concepts and experiments”, IEEE Computer, August 1984, pp. 67–80
J. C. Knight, N. G. Leveson and L. D. St. Jean, “A Large Scale Experiment in N-Version Programming”, Digest of Papers, FTCS-15: Fifteenth Annual Int. Conf. on Fault-Tolerant Computing IEEE, 19–21 June 1985, Ann Arbor MI
H. Hecht, “Fault Tolerant Software for Real-Time Applications”, ACM Computing Surveys, vol. 8, nr. 4, December 1976, pp. 391–407
H. D. Welch, “Distributed Recovery Block Performance in a Real-Time Control Loop”, Proc. Real Time Systems Symp., pp. 268–276, Arlington 1983
J. R. Garman, “The “Bug” Heard Round the World”, ACM Software Engineering Notes, vol. 6, nr. 5, pp. 3–10, October 1981
D. J. Martin, “Dissimilar Software in High Integrity Applications in Flight Control”, Software Avionics, AGARD Conf. Proc. No 300, pp. 36. 1–36. 9, January 1983
O. B. Von Linde, “Computers Can Now Perform Vital Functions Safely”, Railway Gazette International, pp. 1004–1006, November 1979
R. D. Schlichting and F. B. Schneider, “Fail-Stop Processors: An approach to designing fault-tolerant computing systems”, ACM Trans. Computer Systems, vol. 1, nr. 3, pp. 222–238, August 1983
T. Gilb, “Parallel Programming”, Datamation, vol. 20, nr. 10, pp. 160–161, October 1974
E. Best and F. Cristian, “Systematic Detection of Exception Occurrences”, Science of Computer Programming, vol. 1, nr. 1, pp. 115–144, North-Holland, 1981
T. Anderson and R. W. Witty, “Safe Programming”, BIT, vol. 18, pp. 1–8, 1978
R. H. Campbell, A. Koelmans and M. R. McLauchlan, “STRICT-A Design Language for Strongly Typed Recursive Integrated Circuits”, Proc. IEE, March/April 1985, vol. 132, Pts E and I, nr. 2, pp. 108–115
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1987 Springer-Verlag/Wien
About this paper
Cite this paper
Randell, B. (1987). Design Fault Tolerance. In: Avižienis, A., Kopetz, H., Laprie, JC. (eds) The Evolution of Fault-Tolerant Computing. Dependable Computing and Fault-Tolerant Systems, vol 1. Springer, Vienna. https://doi.org/10.1007/978-3-7091-8871-2_10
Download citation
DOI: https://doi.org/10.1007/978-3-7091-8871-2_10
Publisher Name: Springer, Vienna
Print ISBN: 978-3-7091-8873-6
Online ISBN: 978-3-7091-8871-2
eBook Packages: Springer Book Archive