Self-Checking and Self-Exercising Design for Hierarchic Long-Life Fault-Tolerant Systems
This research deals with fault-tolerant computers capable of operating for extended periods without external maintenance. Conventional fault-tolerance techniques such as majority voting are unsuitale for these applications, because performance is too low, power consumption is too high and ab exces- sive number of spares must be included to keep all of the replicated systems working over an extended life. The preferred design approach is to operate as many different computations as possible on single computers, thus maximiz- ing the amount of processing available from limited hardware resources. Fault-tolerance is implemented in a hierarchic fashion. Fault recovery is either done locally within an afflicted computer or, if that unsuccsessfull, by the other working computers when one fails. Concurrent error detrection is required in the computer making up these system since errors must be quickly detected and isolated to allow recovery to begin.
This chaptrer discusses ways of implementing concurrent error detection (i.e., self-checking) and in addition providing self-exercising capabilities that can rapidly expose dormant faults and latent errors. The fundamentals of self- checking design are presented along with an example -- the design of a self - checking self-exercising memory system. A new methodology for implement- ing self-checking in asynchoronous subsystems is discussed along with error simulation result to examine its effectiveness.
KeywordsInput Pair Rout Signal Reset Signal Concurrent Error Detection Undetected Error
Unable to display preview. Download preview PDF.
- 1.Rennels, D. and J. Rohr, “Fault-Tolerant Parallel Processors for Avionics with Reduced Maintenance,” Proc. 9th Digital Avionics Systems Conference, October 15–18, 1990, Virginia Beach, Virginia.Google Scholar
- 2.W.C. Carter, A.B. Wadia, and D.C. Jessep Jr., “Computer Error Control by Testable Morphic Boolean Functions — A Way of Removing Hardcore”, In Proc. 1972 Int. Symp. Fault-Tolerant Computing, pages 154–159, Newton, Massachusetts, June 1972.Google Scholar
- 3.Rennels, D., “Architectures for Fault-Tolerant Spacecraft Computers”, Proc. of the IEEE, October 1978, 66–10: 1255–1268.Google Scholar
- 4.David A. Rennels and Hyeongil Kim, “VLSI Implementation of A Self-Checking Self-Exercising Memory System”. Proc. 21th Int. Symp. Fault-Tolerant Computing, pages 170–177, Montreal, Canada, June 1991.Google Scholar
- 5.Meyer, J. and L. Wei, “Influence of Workload on Error Recovery in Random Access Memories,” IEEE Trans. Computers, April 1988, pp. 500–507.Google Scholar
- 6.Z. Barziiai, V.S. Iyengar, B.K. Rosen, and G.M. Silberman, “Accurate Fault Modeling and Efficient Simulation of Differential CVS Circuits” In International Test Conference, pages 722–729, Philadelphia, PA, Nov 1985.Google Scholar
- 7.R. K. Montoye, “Testing Scheme for Differential Cascode Voltage Switch Circuits”. IBM Technical Disclosure Bulletin, 27(10B):6148–6152, Mar 1985.Google Scholar
- 8.Niraj K. Jha, “Fault Detection in CVS Parity Trees: Application to SSC CVS Parity and Two-Rail Checkers”, In Proc. 19th Int. Symp. Fault-Tolerant Computing, pages 407–414, Chicago, IL, June 1989.Google Scholar
- 10.Andres R. Takach and Niraj K. Jha., “Easily Testable DCVS Multiplier”. In IEEE International Symposium on Circuits and Systems, pages 2732–2735, New Orleans, LA., June 1990.Google Scholar
- 13.Alain J. Martin, Steven M. Burns, T. K. Lee, Drazen Borkovic, and Pieter J. Hazewindus, “The Design of an Asynchronous Microprocessor”. Technical Report Caltech-CS-TR-89-2, CSD, Caltech, 1989Google Scholar
- 15.W.C. Carter and P.R. Schneider, “Design of Dynamically Checked Computers”, In Proc. IFIP Congress 68, pages 878–883, Edinburgh, Scotland, Aug 1968.Google Scholar
- 16.Richard M. Sedmak and Harris L. Liebergot, “Fault Tolerance of a General Purpose Computer Implemented by Very Large Scale Integration”. IEEE Transactions on Computer, 29(6):492–500, June 1980.Google Scholar
- 17.Teresa H. Meng. Synchronization Design for Digital Systems, Kluwer Academic Publishers, 1991.Google Scholar
- 18.A. Avizienis and D. Renneis, “Fault-Tolerance Experiments with the JPL-STAR Computer”. Dig. of the 6th Annual IEEE Computer Society Int. Conf. (COMPCON), San Francisco, 1972, pp. 321–324.Google Scholar