A framework for ABFT techniques in the design of fault-tolerant computing systems
- 2.7k Downloads
We present a framework for algorithm-based fault tolerance (ABFT) methods in the design of fault tolerant computing systems. The ABFT error detection technique relies on the comparison of parity values computed in two ways. The parallel processing of input parity values produce output parity values comparable with parity values regenerated from the original processed outputs. Number data processing errors are detected by comparing parity values associated with a convolution code. This article proposes a new computing paradigm to provide fault tolerance for numerical algorithms. The data processing system is protected through parity values defined by a high-rate real convolution code. Parity comparisons provide error detection, while output data correction is affected by a decoding method that includes both round-off error and computer-induced errors. To use ABFT methods efficiently, a systematic form is desirable. A class of burst-correcting convolution codes will be investigated. The purpose is to describe new protection techniques that are easily combined with data processing methods, leading to more effective fault tolerance.
Keywordsalgorithm-based fault tolerance (ABFT) burst-correcting convolution codes parity values syndrome
algorithm-based fault tolerance
low density parity check
mean square error
A class of high rate burst-correcting convolution codes is discussed in . Convolution codes provide error detection in a continuous manner using the same computational resources as the algorithm progresses. Redinbo  presented a method to wavelet codes into systematic forms for ABFT applications. This method applies high-rate, low-redundancy wavelet codes which use continuous checking attributes for detecting the onset of errors. However, this technique is suited to image processing and data compression applications. In addition, there is a difficult analytical approach to accurate the measures of the detection performances of the ABFT technique using wavelet codes [11, 12].
Figure 1, , shows the basic architecture of an ABFT system. Existing techniques use various coding schemes to provide information redundancy needed for error detection and correction. The coding algorithm is closely related to the running process and is often defined by real number codes generally the block types . Systematic codes are of most interest because the fault detection scheme can be superimposed on the original process box with the least changes in the algorithm and architecture. The goal is to describe new protection techniques that are easily combined with normal data processing methods, leading to more effective fault tolerance. The data processing system is protected through parity sequences specified by a high rate real convolution code. Parity comparisons provide error detection, while output data correction are affected by a decoding method that includes both round-off error and computer-induced errors. The error detection structures are developed and they not only detected subsystem errors, but also corrected errors introduced in the data processing system. Concurrent parity values' techniques are very useful in detecting numerical error in the data processing operations, where a single error can propagate to many output errors.
The following contributions are made in this article: In Section 2, the convolution codes are discussed briefly; in Section 3, the architecture of ABFT (ABFT scheme) and modeling errors are proposed and the method for detecting errors using parity values is discussed; in Section 4, the class of convolution codes: burst-error-correcting convolution codes is discussed; in Section 5, the decoding and corrector system is discussed; in Section 6, the results and evaluations and simulations are presented and finally in Section 7, conclusions are presented.
2. Convolution codes
A convolution code is an error correcting code that processes information serially or, continuously, in short block lengths [15, 16, 17, 18, 19, 20, 21]. A convolution encoder has memory, in the sense that the output symbols depend not only on the input symbols, but also on previous inputs and/or outputs. In other words, the encoder is a sequential circuit[15, 17, 20]. A rate R = k/n convolution encoder with memory order m can be realized as a k-input, n-output linear sequential circuit with input memory order m; that is, inputs remain in the encoder for an additional m time units after entering. Typically, n and k are small integers, k < n, the information sequence is divided into blocks of length k, and the codeword is divided into blocks of length n. In the important special case, when k = 1, the information sequence is not divided into blocks and is processed continuously. Unlike with block codes, large minimum distances and low error probabilities are achieved not by increasing k and n but by increasing the memory order m [16, Chapter 11]. We consider only systematic forms of convolution codes because the normal operation of Process block is not altered and there is no need to decoding for obtaining true outputs. A systematic real convolution code guarantees that faults representing errors in the processed data will result in notable non-zero values in syndrome sequence. Systematic encoding means that the information bits always appear in the first k positions of a code word, leftmost. The remaining n - k bits in a code word are a function of the information bits, and provide redundancy that can be used for error correction and/or detection purposes, rightmost. Real number convolution codes may find applications in channel coding for communication systems and in fault-tolerant data processing systems containing error correction. Real-number codes can be constructed easily from finite-field codes, viewing the field elements as corresponding integers in the real number field, and as such theoretically have as good if not better properties as the original finite field structures .
3. Code usage for ABFT and ABFT scheme
3.1. Code usage for ABFT
A real convolution code in systematic form  is used to compute parity values associated with the processing outputs as shown in Figure 2. Certain classes of errors occurring anywhere in the overall system including the parity generation and regeneration subsystems are easily detected. A convolution code with its encoding memory can sense the onset of errors before they increase beyond detection limits. For a rate k/n real convolution code with constraint parameter, it is always possible by simple linear operations to extract the parity generating part. The (n - k) parity samples for each processed block of samples are produced in block processing fashion. Since processing resources are in close proximity, it is easily demonstrated  that an efficient block processing structure can produce the (n - k) parity values directly from the inputs. When these two comparable parity values are subtracted, one from the outputs and the others directly from the inputs, only the stochastic effects remain, and the syndromes are produced as shown in Figure 2.
3.2. Modeling errors
3.3. ABFT scheme
To achieve fault detection and correction properties of convolution code in a linear process with the minimum overhead computations, the architecture is proposed in Figure 2. For error correction purposes, redundancy must be inserted in some form and convolution parity codes will be employed, using the ABFT. A systematic form of convolution codes is especially profitable in the ABFT detection plan because no redundant transformations are needed to achieve the processed data after the detection operations. Figure 2 summarizes an ABFT technique employing a systematic convolution code to define the parity values. The data processing operations are combined with the parity generating function to provide one set of parity values. The k is the basic block size of the input data, and n is the block size of the output data, new data samples are accepted and (n - k) new parity values are produced.
The upper way, Figure 2, is the processed data flow which passes through the process block (data processing block) and then feeds the convolution encoder (parity regeneration) to produce parity values. On the other hand, the comparable parity values are generated efficiently and directly from the inputs (parity and processing combined, see Figure 2), without producing the original outputs. The difference in the comparable two parity values, which are computed in different ways, is called the syndrome; the syndrome sequence is a stream of zero or near zero values. The convolution code's structure is designed to produce distinct syndromes for a large class of errors appearing in the processing outputs. Figure 2 employs convolution code parity in detecting and correcting processing errors.
3.4. Error detection
4. Burst-error-correcting convolution codes
A burst of length d is defined as a vector and the non-zero components are confined to d consecutive digit positions, the first and last are non-zero [16, 17]. A burst refers to a group of possibly contiguous errors which is characteristic of unforeseeable effects of errors in data computation. Only systematic forms of convolution codes are considered here because the normal operation of Process block has not changed and there is no need for decoding to obtain true outputs. Moreover, convolution codes have good correcting characteristics because of memory in their encoding structure .
4.1. Bounds on burst-error-correcting convolution codes
Costello and Lin  have shown that a sequence of error bits ed+1, e d +2 , ..., e d+a is called a burst of length a relative to a guard space of length b if
the b bits preceding e d+1and the b bits following e d+aare zero;
the a bits from e d+1through e d+acontain no subsequence of b zero.
where I n-k and 0 n-k are identity and all zero k × k matrixes, respectively, and P i is an (n - 1) × k matrix. However, in an alternate view, the respective rows of H0 contain the parity submatrices P i needed in
5. Decoding and corrector system
6. Simulations and results
6.1. Design evaluation
By comparing the differences between the two parity values and , we can show the checking system responding to error.
6.2. Mean square error performance
6.3. Examples and simulations
A BP burst-correcting convolution code (6, 5, 11) is constructed  for use with a fault-tolerant processing situation. A rate 1/3 (3, 1, 10) code is chosen from a standard text  which have a constraint parameter m = 10. Long simulations involving 250, 000 blocks of data over a wide range of variances are performed. For the rate 1/3 code, this represented 750, 000 samples, while for the rate 5/6 code case it implied 1.5 million samples. Burst and errors within each block are permitted. A burst in this context means that the standard deviations of all components in a block are raised to 10% of the maximum standard deviation. On the other hand, when a burst is not active, errors are allowed with positions within a block chosen independently at random, and those selected had their standard deviations raised to 10% of their maxima. The probability of a burst is 5 × 10-3, while intra block errors have probability 10-3. For long simulations, the basic parameter σ2 (variance of error) is changed from 10-9 up to 3.2.
This article addresses new methods for performing error correction when real convolution codes are involved. Real convolution codes can provide effective protection for data processing operations at the data-parity level. Data processing implementations are protected against both hard and soft errors. The data processing system is protected through parity sequences specified by a high rate real convolution code. Parity comparisons provide error detection, while output data correction is affected by a decoding method that includes both round-off error and computer-induced errors. The error detection structures are developed and they not only detected subsystem errors, but also corrected errors introduced in the data processing system. Concurrent parity values techniques are very useful in detecting numerical error in the data processing operations, where a single error can propagate to many output errors. Parity values are the most effective tools used to detect burst errors occurring in the code stream. The detection performance in the data processing system depends on the detection threshold, which is determined by round-off tolerances. The structures have been tested using MATLAB programs and compute error detecting performance of the concurrent parity values method and simulation results are presented.
The authors are grateful to the comments from Mrs. Mahbobeh Meshkinfam that significantly improved the quality of this article.
- 5.Rexford J, Jha NK: Algorithm-based fault tolerance for floating-point operations in massively parallel systems. Proc Int Symp on Circuits & Systems 1992, 649-652.Google Scholar
- 15.Veeravalli VS: Fault tolerance for arithmetic and logic unit. IEEE Southeastcon 09 2009, 329-334.Google Scholar
- 16.Costello D, Lin S: Error Control Coding Fundamentals and Applications. 2nd edition. Pearson Education Inc., NJ; 2004.Google Scholar
- 18.Viterbi AJ, Omura JK: Principles of Digital Communication and Coding. 2nd edition. Mc-Grawhill; 1985.Google Scholar
- 21.Lee LHC: Convolutional Coding: Fundamentals and Applications. Artech House; 1997.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.