Abstract
This paper presents an overview of the techniques that can be used by developers to produce software that can tolerate design faults and faults of the surrounding environment. After reviewing the basic terms and concepts of fault tolerance, the most well-known fault-tolerance techniques exploiting software-, information- and time redundancy are presented, classified according to the kind of concurrency they support.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Pullum, L.L.: Software Fault Tolerance-Techniques and Implementation, Artech House, Boston, 2001.
Laprie, J.-C.: “Dependable Computing and Fault Tolerance: Concepts and Terminology”, in Proceedings of the 15th International Symposium on Fault-Tolerant Computing Systems (FTCS-15), pp. 2–11, Ann Arbour, MI, USA, June 1985.
Cristian, F.: “Understanding Fault-Tolerant Distributed Systems”, Communications of the ACM 34(2), February 1991, pp. 56–78.
Lamport, L.; Shostak, R.; Pease, M.: “The Byzantine Generals Problem”, ACM Transactions on Programming Languages and Systems 4(3), pp. 382–401, 1982.
Kopetz, H.: Real-Time Systems — Design Principles for Distributed Embedded Applications, Kluwer Academic Publishers, 1997.
Lee, P.A.; Anderson, T.: “Fault Tolerance — Principles and Practice”, in Dependable Computing and Fault-Tolerant Systems, Springer Verlag, 2nd ed., 1990.
Randell, B.; Xu, J.: The Evolution of the Recovery Block Concept, chapter 1, pp. 1–21, in Lyu, M.R. (Ed.): Software Fault Tolerance, John Wiley & Sons, 1995.
IEEE Standard 729-1982: “IEEE Glossary of Software Engineering Terminology”, 1982.
Horning, J.J, et al.: “A Program Strucure for Error Detection and Recovery”, in E. Gelenbe and C. Kaiser (eds.), Lecture Notes in Computer Science 16, pp. 171–187, Springer, 1974.
Randell, B.: “System Structure for Software Fault Tolerance”, IEEE Transactions on Software Engineering SE-1(2), pp. 220–232, 1975.
Ammann, P.E.; Knight, J.C.: “Data Diversity: An Approach to Software Fault Tolerance”, Proceedings of the 17th International Symposium on Fault-Tolerant Computing Systems (FTCS-17), Pittsburgh, PA, pp. 122–126, 1987.
Elmendorf, W.R.: “Fault Tolerant Programming”, Proceedings of the 2nd International Symposium on Fault-Tolerant Computing Systems (FTCS-2), Newton, MA, pp. 79–83, 1972.
Chen, L. and Avizienis, A.: “N-Version Programming: A Fault Tolerance Approach to Reliability of Software Operation”, Proceedings of the 8th International Symposium on Fault-Tolerant Computing Systems (FTCS-8), Toulouse, France, pp. 3–9, 1978.
Ammann, P.E.; Knight, J.C.: “Data Diversity: An Approach to Software Fault Tolerance”, IEEE Transactions on Computers 37(4), pp. 418–425, 1988.
Brilliant, S.S.; Knight, J.C.; Leveson, N.G.: “The Consistent Comparison Problem in NVersion Software”, IEEE Transactions on Software Engineering 15(11), pp. 1481–1485, 1989.
Avizienis, A.: “The N-Version Approach to Fault-Tolerant Software”, IEEE Transactions on Software Engineering SE-11(12), pp. 1491–1501, 1985.
Vouk, M.A. et al.: “An Empirical Evaluation of Consensus Voting and Consensus Recovery Block Reliability in the Presence of Failure Correlation”, Journal of Computer and Software Engineering 1(4), pp. 367–388, 1993.
Gray, J.; Reuter, A.: Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers, San Mateo, California, 1993.
Pu, C.; Kaiser, G.E.; Hutchinson, N.C.: “Split-Transactions for Open-Ended Activities”, in 14th International Conference on Very Large Data Bases, pp. 26–37, Los Angeles, California, Morgan Kaufmann, 1988.
Vinter, S.; Ramamritham, K.; Stemple, D.: “Recoverable Actions in Gutenberg”, in Proceedings of the 6th International Conference on Distributed Computing Systems, pp. 242–249, Los Angeles, Ca., USA, IEEE Computer Society Press, 1986.
Garcia-Molina, H.; Salem, K.: “SAGAS”, in Proceedings of the SIGMod 1987 Annual Conference, pp. 249–259, San Francisco, CA, ACM Press, May 1987.
Moss, J. E. B.: Nested Transactions, An Approach to Reliable Computing. PhD Thesis, MIT, Cambridge, April 1981.
Kienzle, J.; Strohmeier, A.; Romanovsky, A.: “Auction System Design Using Open Multithreaded Transactions”. In Proceedings of the 7th IEEE International Worshop on Object-Oriented Real-Time Dependable Systems (WORDS’02), San Diego, CA, USA, January 7th–9th, 2002, ppp. 95–104, IEEE Computer Society Press, Los Alamitos, California, USA, 2002.
Randell, B.: “System Structure for Software Fault Tolerance”, IEEE Transactions on Software Engineering 1 (2), pp. 220–232, 1975.
Strigini, L.; Giandomenico, F.D.; Romanovsky, A.: “Coordinated Backward Recovery between Client Processes and Data Servers”, IEEE Proceedings — Software Engineering 144 (2), pp. 134–146, April 1997.
Campbell, R.H.; Randell, B.: “Error Recovery in Asynchronous Systems”, IEEE Transactions on Software Engineering SE-12 (8), pp. 811–826, August 1986.
Xu, J.; Randell, B.; Romanovsky, A.; Rubira, C.M.F.; Stroud, R.J.; Wu, Z.: “Fault Tolerance in Concurrent Object-Oriented Software through Coordinated Error Recovery”, in Proceedings of the 25th International Symposium on Fault-Tolerant Computing Systems (FTCS-25), pp. 499–509, Pasadena, California, 1995.
Xu, J.; Romanovsky, A.; Randell, B.: “Concurrent Exception Handling and Resolution in Distributed Object Systems”, IEEE Transactions on Parallel and Distributed Systems 11 (11), pp. 1019–1032, November 2000.
Kienzle, J.; Romanovsky, A.; Strohmeier, A.: “Open Multithreaded Transactions: Keeping Threads and Exceptions under Control”. In Proceedings of the 6th International Worshop on Object-Oriented Real-Time Dependable Systems, Universita di Roma La Sapienza, Roma, Italy, January 8th–10th, 2001, pp. 197–205, IEEE Computer Society Press, Los Alamitos, California, USA, 2001.
Athavale, A.: “Performance Evaluation of Hybrid Voting Schemes”, M.S. thesis, North Carolina State University, Department of Computer Science, 1989.
Tai, A.T.; Meyer, J.F.; Aviziensis, A.: “Performability Enhancement of Fault-Tolerant Software”, IEEE Transactions on Reliability 42(2), pp. 227–237, 1993.
Kim, K.H.: “Distributed Execution of Recovery Blocks: An Approach to Uniform Treatment of Hardware and Software Faults”, Proceedings of the Fourth International Conference on Distributed Computing Systems, pp. 526–532, 1984.
Kim, K.H.: “The Distributed Recovery Block Scheme”, in M.R. Lyu (ed.), Software Fault Tolerance, New York, John Wiley & Sons, pp. 189–209, 1995.
Scott, R.K.; Gault, J.W.; Mc Allister, D.F.: “The Consensus Recovery Block”, Proceedings of the Total Systems Reliability Symposium, Gaithersburg, MD, pp. 95–104, 1983.
Pullum, L.L.: “Fault-Tolerant Software Decision-Making Under the Occurrence of Multiple Correct Results”, Ph.D. thesis, Southeastern Institute of Technology, 1992.
Bondavelli, A.; Di Giandomenico, F.; Xu, J.: “Cost-Effective and Flexible Scheme for Software Fault Tolerance”, Journal of Computer System Science & Engineering 8(4), pp. 234–244, 1993.
Object Management Group, Inc.: Object Transaction Service, Version 1.1, May 2000.
Shannon, B.; Hapner, M.; Matena, V.; Davidson, J.; Pelegri-Llopart, E.; Cable, L.: Java 2 Platform Enterprise Edition: Platform and Component Specification. The Java Series, Addison Wesley, Reading, MA, USA, 2000.
ISO: International Standard ISO/IEC 8652:1995(E): Ada Reference Manual, Lecture Notes in Computer Science 1246, Springer Verlag, 1997; ISO, 1995.
Rodgers, P.; Wellings, A.J.: “An Incremental Recovery Cache Supporting Software Fault Tolerance”, in Reliable Software Technologies-Ada-Europe’99, Santander, Spain, June 7–11, 1999, Lecture Notes in Computer Science 1622, pp. 385–396, 1999.
Kienzle, J.; Strohmeier, A.: “Shared Recoverable Objects”, in Reliable Software Technologies-Ada-Europe’99, Santander, Spain, June 7–11, 1999, Lecture Notes in Computer Science 1622, pp. 397–411, 1999.
Romanovsky, A.; Mitchell, S.E.; Wellings, A.J.: “On Programming Atomic Actions in Ada 95”, Ada Europe’97, London, Lecture Notes in Computer Science 1251, pp. 254–265, 1997.
Mitchell, S.E.; Wellings, A.J.; Romanovsky, A.: “Distributed Atomic Actions in Ada 95”, The Computer Journal 41(7), pp. 486–502, 1998.
Romanovsky, A.; Randell, B.; Stroud, R.; Xu, J.; Zorzo, A.: “Implementation of Blocking Coordinated Atomic Actions Based on Forward Error Recovery”, Journal of System Architecture (Special Issue on Dependable Parallel Computing Systems) 43(10), pp. 687–699, September, 1997.
Kienzle, J.; Jiménez-Peris, R.; Romanovsky, A.; Patiño-Martinez, M.: “Transaction Support for Ada”. In Reliable Software Technologies — Ada-Europe’2001, Leuven, Belgium, May 14–18, 2001, pp. 290–304, Lecture Notes in Computer Science 2043, Springer Verlag, 2001.
Maes, P.: “Concepts and Experiments in Computational Reflection”, ACM SIGPLAN Notices 22 (12), December 1987, pp. 147–155.
Xu, J.; Randell, B.; Zorzo, A. F.: “Implementing Software-Fault Tolerance in C++ and Open C++: An Object-Oriented and Reflective Approach”, Proc. Int. Workshop on Computer-Aided Design, Test, and Evaluation for Dependability (CADTED96), Beijing, China, pp. 224–229, Int. Academic Publ., 1996.
Elrad, T.; Aksit, M.; Kiczales, G.; Lieberherr, K.; Ossher, H.: “Discussing Aspects of AOP”. Communications of the ACM 44(10), pp. 33–38, October 2001.
Kienzle, J.; Guerraoui, R.: “AOP — Does it make sense? The case of concurrency and failures”. In Proceedings of the 16th European Conference on Object-Oriented Programming (ECOOP 2002), pp. 37–54, Malaga, Spain, June 2002, Lecture Notes in Computer Science 2374, Springer Verlag, 2002.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kienzle, J. (2003). Software Fault Tolerance: An Overview. In: Rosen, JP., Strohmeier, A. (eds) Reliable Software Technologies — Ada-Europe 2003. Ada-Europe 2003. Lecture Notes in Computer Science, vol 2655. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44947-7_3
Download citation
DOI: https://doi.org/10.1007/3-540-44947-7_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40376-0
Online ISBN: 978-3-540-44947-8
eBook Packages: Springer Book Archive