It is sometimes argued that with the right training, discipline, and tools it should be possible to produce zero-defect code. Very few things in life, though, are zero-defect—not even things that can be considered life critical. If you practice sky-diving, your main parachute could fail to open, no matter how carefully you check it before each jump. A parachutist would be wise not to trust a company that tries to sell him a zero-defect parachute. The jumper is more likely to avoid problems by bringing a spare chute. That is: the seasoned parachutist takes the possibility of component failure into account in the adoption of a system that has a relatively low probability of system failure. We can provide system reliability, even when none of the system components are zero-defect. In many cases, though, mere redundancy does not solve the problem (i.e., multiple sky-jumpers in parallel). Reliable systems are designed with the possibility of component failure in mind, and with remedies in place to reduce the odds of system failure. Component failure is a rarely and isolated event though. In this chapter we will consider the nature of failure in complex software systems, and how we can develop methods to leverage these insights.
The research described in this chapter was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration.
- 1.Anderson, T., Barrett, P.A., Halliwell, D.N., Moudling, M.L.: An evaluation of software fault tolerance in a practical system. In: Fault Tolerant Computing Symposium, pp. 140–145 (1985) Google Scholar
- 2.Aviz̆ienis, A.A.: Software fault tolerance. In: The Methodology of N-Version Programming, pp. 23–46. Wiley, New York (1995) Google Scholar
- 3.Knight, J.C., Leveson, N.G.: An experimental evaluation of the assumption of independence in multi-version programming. IEEE Trans. Softw. Eng. 12(1), 96–109 (1986) Google Scholar
- 4.Kudrjavets, G., Nagappan, N., Ball, T.: Assessing the relationship between software assertions and code quality: an empirical investigation. Tech. rep. MSR-TR-2006-54, Microsoft Research (2006) Google Scholar
- 5.Lions, J.-L.: Report of the inquiry board for the Ariane 5 flight 501 failure (1996). Joint Communication, European Space Agency, ESA-CNES, Paris, France Google Scholar
- 6.Perrow, C.: Normal Accidents: Living with High Risk Technologies. Princeton University Press, Princeton (1984) Google Scholar
- 7.Randell, B., Xu, J.: The evolution of the recovery block concept. In: Lyu, M.R. (ed.) Software Fault Tolerance, pp. 1–21. Wiley, New York (1995) Google Scholar
- 8.Rasmussen, R.D., Litty, E.C.: A voyager attitude control perspective on fault tolerant systems. In: AIAA, Alburquerque, NM, pp. 241–248 (1981) Google Scholar
- 9.Reeves, G.E., Neilson, T.A.: The mars rover spirit FLASH anomaly. In: IEEE Aerospace Conference, Big Sky, Montana (2005) Google Scholar
- 10.Rushby, J.: Partitioning in avionics architectures: requirements, mechanisms, and assurance. Technical report, Computer Science Laboratory, SRI (1999). Draft technical report Google Scholar