Abstract
Sequoia’s fault-tolerant computers were designed subject to some rather rigid constraints: No single hardware malfunction can generate an undetected error; an integrated circuit is a “black box” that can fail in arbitrary ways, affecting an arbitrary subset of input and output signals; faults can be transient or intermittent with arbitrary durations and repetition intervals. Moreover, the incremental hardware to be used to achieve these goals was to be kept to a minimum. The resulting computers do, to a very large extent, satisfy these constraints. To achieve this, a combination of fault-monitoring techniques was used, including: Bit and nibble error-correcting and error-detecting codes; byte parity codes with orthogonal partitioning; cyclic-residue codes on I/O data transfers; codes designed to protect against address counter overruns on I/O transfers; lossless control-signal compactors. The nature and rationale for these various fault monitors is described as well as the analytical and testing techniques used to estimate the resulting coverage.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
I. P.A. Bernstein, “Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing,” Computer, pp. 37–45, Feb. 1988.
K.M. Chandy and C.V. Ramamoorthy, “Rollback and Recovery Strategies for Computer Programs,” IEEE Trans. on Computers, Vol. 21, No. 6, pp. 546–556, June 1972.
E.R. Berlekamp, “The Technology of Error-Control Codes,” Proc. of the IEEE, May 1980, Vol. 68, No. 5, pp. 564–593.
B. Bose and T.R.N. Rao, “Theory of Unidirectional Error Correcting/Detecting Codes,”IEEE Trans. on Computers, Vol. C-31, No. 6, pp. 520–530, June 1982.
J.J. Metzner, “Convolutionally Encoded Memory Protection,” IEEE Trans. on Computers, Vol. C-31, No. 6, pp. 547–551, June 1982.
D.K. Pradhan, “A New Class of Error Correcting-Detecting Codes for Fault-Tolerant Computer Applications,” IEEE Trans. on Computers, Vol. C-29, No. 6, pp. 471–481, June 1980.
D.K. Pradhan and J.J. Stiffler, “Error Correcting Codes and Self-Checking Circuits,” Computer, Vol. 13, No. 3, pp. 27–37, March 1980.
T.R.N. Rao, Error Control Coding for Arithmetic Processors, Academic Press, New York, 1974.
J.J. Stiffler, “Coding for Random Access Memories,” IEEE Trans. on Computers, Vol. C-27, No. 6, pp. 526–531, June 1978.
J.F. Wakerly, “Detection of Unidirectional Multiple Errors Using Low-Cost Arithmetic Codes,” IEEE Trans. on Computers, Vol. C-27, No. 4, pp. 302–308, April 1978.
R.W. Hamming, “Error Detecting and Correcting Codes,” Bell Syst. Tech. Journal, Vol. 29, pp. 147–160, 1950.
N.J.A. Sloane, “A Simple Description of an Error-Correcting Code for High-Density Magnetic Tape, ”Bell Syst. Tech. Journal, Vol. 55, No. 2, pp. 157–165, Feb. 1976.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1998 Springer Science+Business Media New York
About this chapter
Cite this chapter
Stiffler, J.J. (1998). On-Line Fault Monitoring. In: Nicolaidis, M., Zorian, Y., Pradan, D.K. (eds) On-Line Testing for VLSI. Frontiers in Electronic Testing, vol 11. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-6069-9_2
Download citation
DOI: https://doi.org/10.1007/978-1-4757-6069-9_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-5033-8
Online ISBN: 978-1-4757-6069-9
eBook Packages: Springer Book Archive