# A framework for ABFT techniques in the design of fault-tolerant computing systems

- 2.8k Downloads
- 3 Citations

## Abstract

We present a framework for algorithm-based fault tolerance (ABFT) methods in the design of fault tolerant computing systems. The ABFT error detection technique relies on the comparison of parity values computed in two ways. The parallel processing of input parity values produce output parity values comparable with parity values regenerated from the original processed outputs. Number data processing errors are detected by comparing parity values associated with a convolution code. This article proposes a new computing paradigm to provide fault tolerance for numerical algorithms. The data processing system is protected through parity values defined by a high-rate real convolution code. Parity comparisons provide error detection, while output data correction is affected by a decoding method that includes both round-off error and computer-induced errors. To use ABFT methods efficiently, a systematic form is desirable. A class of burst-correcting convolution codes will be investigated. The purpose is to describe new protection techniques that are easily combined with data processing methods, leading to more effective fault tolerance.

## Keywords

algorithm-based fault tolerance (ABFT) burst-correcting convolution codes parity values syndrome## Abbreviations

- ABFT
algorithm-based fault tolerance

- BP
Berlekamp-Preparata

- LDPC
low density parity check

- MSE
mean square error

- RS
Reed-Solomon.

## 1. Introduction

A class of high rate burst-correcting convolution codes is discussed in [10]. Convolution codes provide error detection in a continuous manner using the same computational resources as the algorithm progresses. Redinbo [11] presented a method to wavelet codes into systematic forms for ABFT applications. This method applies high-rate, low-redundancy wavelet codes which use continuous checking attributes for detecting the onset of errors. However, this technique is suited to image processing and data compression applications. In addition, there is a difficult analytical approach to accurate the measures of the detection performances of the ABFT technique using wavelet codes [11, 12].

Figure 1, [13], shows the basic architecture of an ABFT system. Existing techniques use various coding schemes to provide information redundancy needed for error detection and correction. The coding algorithm is closely related to the running process and is often defined by real number codes generally the block types [14]. Systematic codes are of most interest because the fault detection scheme can be superimposed on the original process box with the least changes in the algorithm and architecture. The goal is to describe new protection techniques that are easily combined with normal data processing methods, leading to more effective fault tolerance. The data processing system is protected through parity sequences specified by a high rate real convolution code. Parity comparisons provide error detection, while output data correction are affected by a decoding method that includes both round-off error and computer-induced errors. The error detection structures are developed and they not only detected subsystem errors, but also corrected errors introduced in the data processing system. Concurrent parity values' techniques are very useful in detecting numerical error in the data processing operations, where a single error can propagate to many output errors.

The following contributions are made in this article: In Section 2, the convolution codes are discussed briefly; in Section 3, the architecture of ABFT (ABFT scheme) and modeling errors are proposed and the method for detecting errors using parity values is discussed; in Section 4, the class of convolution codes: burst-error-correcting convolution codes is discussed; in Section 5, the decoding and corrector system is discussed; in Section 6, the results and evaluations and simulations are presented and finally in Section 7, conclusions are presented.

## 2. Convolution codes

A convolution code is an error correcting code that processes information *serially* or, continuously, in short block lengths [15, 16, 17, 18, 19, 20, 21]. A convolution encoder has *memory*, in the sense that the output symbols depend not only on the input symbols, but also on previous inputs and/or outputs. In other words, the encoder is a *sequential circuit*[15, 17, 20]. A rate *R = k/n* convolution encoder with memory order *m* can be realized as a *k*-input, *n*-output linear sequential circuit with input memory order *m*; that is, inputs remain in the encoder for an additional *m* time units after entering. Typically, *n* and *k* are small integers, *k < n*, the information sequence is divided into blocks of length *k*, and the codeword is divided into blocks of length *n*. In the important special case, when *k =* 1, the information sequence is not divided into blocks and is processed continuously. Unlike with block codes, large minimum distances and low error probabilities are achieved not by increasing *k* and *n* but by increasing the memory order *m* [16, Chapter 11]. We consider only systematic forms of convolution codes because the normal operation of Process block is not altered and there is no need to decoding for obtaining true outputs. A systematic real convolution code guarantees that faults representing errors in the processed data will result in notable non-zero values in syndrome sequence. Systematic encoding means that the information bits always appear in the first *k* positions of a code word, leftmost. The remaining *n* - *k* bits in a code word are a function of the information bits, and provide redundancy that can be used for error correction and/or detection purposes, rightmost. Real number convolution codes may find applications in channel coding for communication systems and in fault-tolerant data processing systems containing error correction. Real-number codes can be constructed easily from finite-field codes, viewing the field elements as corresponding integers in the real number field, and as such theoretically have as good if not better properties as the original finite field structures [6].

## 3. Code usage for ABFT and ABFT scheme

### 3.1. Code usage for ABFT

A real convolution code in systematic form [16] is used to compute parity values associated with the processing outputs as shown in Figure 2. Certain classes of errors occurring anywhere in the overall system including the parity generation and regeneration subsystems are easily detected. A convolution code with its encoding memory can sense the onset of errors before they increase beyond detection limits. For a rate *k*/*n* real convolution code with constraint parameter, it is always possible by simple linear operations to extract the parity generating part. The (*n* - *k*) parity samples for each processed block of samples are produced in block processing fashion. Since processing resources are in close proximity, it is easily demonstrated [9] that an efficient block processing structure can produce the (*n* - *k*) parity values directly from the inputs. When these two comparable parity values are subtracted, one from the outputs and the others directly from the inputs, only the stochastic effects remain, and the syndromes are produced as shown in Figure 2.

### 3.2. Modeling errors

*e*to the calculated output:

*z*=

*y*+

*e*.

### 3.3. ABFT scheme

To achieve fault detection and correction properties of convolution code in a linear process with the minimum overhead computations, the architecture is proposed in Figure 2. For error correction purposes, redundancy must be inserted in some form and convolution parity codes will be employed, using the ABFT. A systematic form of convolution codes is especially profitable in the ABFT detection plan because no redundant transformations are needed to achieve the processed data after the detection operations. Figure 2 summarizes an ABFT technique employing a systematic convolution code to define the parity values. The data processing operations are combined with the parity generating function to provide one set of parity values. The *k* is the basic block size of the input data, and *n* is the block size of the output data, new data samples are accepted and (*n* - *k*) new parity values are produced.

The upper way, Figure 2, is the processed data flow which passes through the process block (data processing block) and then feeds the convolution encoder (parity regeneration) to produce parity values. On the other hand, the comparable parity values are generated efficiently and directly from the inputs (parity and processing combined, see Figure 2), without producing the original outputs. The difference in the comparable two parity values, which are computed in different ways, is called the syndrome; the syndrome sequence is a stream of zero or near zero values. The convolution code's structure is designed to produce distinct syndromes for a large class of errors appearing in the processing outputs. Figure 2 employs convolution code parity in detecting and correcting processing errors.

### 3.4. Error detection

*S*, between the two parity values and determines if its magnitude is smaller than a chosen threshold determined by round-off error,

*S*= ${\stackrel{\u0304}{p}}_{{l}_{i}}$ - ${\stackrel{\u0304}{p}}_{{u}_{i}}$ if |

*S*| <

*τ*then there is no error (

*τ*is threshold). The difference between the parity values, considering a round-off threshold,

*τ*, can be used to detect a error. This threshold

*τ*places a bound on the effects of errors appearing at the output, modeled here as a vector

*e*which is added to the true output

*y*to characterize the observed output

*z*=

*y*+

*e*, see Figure 3. A total self-checking checker (comparator) for real number parities using a detection threshold is described in [9, 11]. Its role is to indicate if an error has occurred in the process using the parities ${\stackrel{\u0304}{p}}_{{l}_{i}}$ and ${\stackrel{\u0304}{p}}_{{u}_{i}}$. The comparator is constructed by producing a 1-out-of-2 codeword at terminals (sign threshold, banded thresholds) = (

*T*

_{SGN},

*T*

_{ τ }) as shown in Figure 4. Given that

*s*truly represents ${\stackrel{\u0304}{p}}_{{l}_{i}}$ - ${\stackrel{\u0304}{p}}_{{u}_{i}}$, if either |

*S*| ≥

*τ*, the sign, or the value-characterize unit has failed when valid parity inputs are applied, the output will not be a valid 1-out-of-2 code. Otherwise, the comparator and its checking parts give a 1-out-of-2 code indicating that no error has occurred in the data processing unit and its checking facilities. The precision required for the two parity values, the value characterizations in Figure 4, only need to meet the separation by the threshold value to be effective for detection.

## 4. Burst-error-correcting convolution codes

A burst of length *d* is defined as a vector and the non-zero components are confined to *d* consecutive digit positions, the first and last are non-zero [16, 17]. A burst refers to a group of possibly contiguous errors which is characteristic of unforeseeable effects of errors in data computation. Only systematic forms of convolution codes are considered here because the normal operation of Process block has not changed and there is no need for decoding to obtain true outputs. Moreover, convolution codes have good correcting characteristics because of memory in their encoding structure [17].

### 4.1. Bounds on burst-error-correcting convolution codes

Costello and Lin [16] have shown that a sequence of error bits *e*_{d+1}, e_{ d }_{+2} , ..., *e*_{ d+a } is called a burst of length *a* relative to a guard space of length *b* if

*1. e*

_{ d }

_{+1}=

*e*

_{d+a}= 1;

- 2.
the

*b*bits preceding*e*_{d+1}and the*b*bits following*e*_{d+a}are zero; - 3.
the

*a*bits from*e*_{d+1}through*e*_{d+a}contain no subsequence of*b*zero.

*R*that corrects all bursts of length

*a*or less relative to a guard space of length

*b*,

*a*to be decoded incorrectly, the guard space requirements can be reduced significantly. In particular, for a convolution code of rate

*R*that corrects all but a fraction

*ψ*of bursts of length

*a*or less relative to a guard space of length

*b*

*ψ*. The bound of (2) is known as the bound on

*almost all*burst-error correction. Burst-correcting convolution codes at structure of the convolution codes are appropriate and efficient in detecting and correcting errors from internal computing failures. Burst-correcting convolution codes need guard bands (error-free regions) before and after bursts of errors, particularly if error correction is needed [16]. One class of burst-correcting codes is the Berlekamp-Preparata (BP) codes [16, 17, 18, 19, 20] that have many appropriate characteristic with regard to failure error-detecting. Their design properties guarantee for detecting the onset of errors because of failures, regardless of any error-free region following the beginning of a burst of errors. Consider designing an (

*n*,

*k*=

*n*- 1,

*m*) systematic convolution encoder to correct a phased burst error confined to a single block of

*n*bits relative to a guard space of

*m*error-free blocks. To design such a code, we must assure that each correctable error value [

*E*]

_{ m }= [

*e*

_{0},

*e*

_{1}, ...,

*e*

_{ m }] results in a distinct syndrome [

*S*]

_{ m }= [

*s*

_{0},

*s*

_{1}, ...,

*s*

_{ m }]. This implies that each error value with

*e*

_{0}≠ 0 and

*e*

_{ d }= 0,

*d*= 1, 2, ...,

*m*must yield a distinct syndrome and that each of these syndromes must be distinct from the syndrome caused by any error value with

*e*

_{0}= 0 and a single block

*e*

_{ d }≠ 0,

*d*= 1, 2, ...,

*m*. Therefore, the first error block

*e*

_{0}can correctly be decoded if first (

*m*+ 1) blocks of

*e*contain at most one non-zero block, and assuming feedback decoding, each successive error block can be decoded in the same way. An (

*n*,

*k*=

*n*- 1,

*m*) systematic code is depicted by the set of generator polynomials

*g*

_{1}

^{(n-1)}(

*D*),

*g*

_{2}

^{(n-1)}(

*D*), ...,

*g*${}_{n-1}^{\left(n-1\right)}$(

*D*). The generator matrix of a systematic convolution code,

*G*, is a semi-finite matrix evolving

*m*finite sub-matrixes as

*I*and 0 are identity and all zero

*k*×

*k*matrixes, respectively, and

*P*

_{ i }with

*i*= 0 to

*m*is a

*k*× (

*n*-

*k*) matrix [18]. The parity-check matrix is constructed from a basic binary matrix, labeled

*H*

_{0}, a 2

*n*×

*n*binary matrix containing the skew-identity matrix in its top

*n*rows (4).

*H*

_{0}is an

*n*× (

*m +*1) matrix (5):

*d*≤

*m*, we obtain

*H*

_{ d }from

*H*

_{d- 1}by shifting

*H*

_{d- 1}one column to the right and deleting the last column. Mathematically, this operation can be expressed as

*T*is an (

*m +*1) × (

*m +*1) shifting matrix. Another important parity check type of matrix is put together using

*H*

_{0}and its

*d*successive downward shifted versions [19]. However, all necessary information for forming the systematic parity check matrix

*H*

^{T}is contained in the basis matrix

*H*

_{0}. The lower triangular part of this matrix, (

*n*- 1) rows, (

*n*- 1) columns, hold binary values selected by a construction method to produce desirable detection and correction properties [19]. For systematic codes, the parity check matrix submatrices

*H*

_{ m }in (4) have special forms that control how these equations are formed.

where *I*_{ n-k } and 0 _{ n-k } are identity and all zero *k* × *k* matrixes, respectively, and *P*_{ i } is an (*n -* 1) × *k* matrix. However, in an alternate view, the respective rows of *H*_{0} contain the parity submatrices *P*_{ i } needed in

*H*

^{T}, (4) and (7):

*n*columns of

*H*

_{0}are designed as an

*n*-dimensional subspace of a full (2

*n*)-dimensional space comparable with the size of the row space. Using this notation, the syndrome

*S*]

_{ m }is a syndrome vector with (

*l+*1) values, in this class of codes (

*n*-

*k*) equal 1. The design properties of this class of codes assure any contribution of errors in one observed vector, [

*E*]

_{ m }, appearing in syndrome vector [

*S*]

_{ m }is linearly independent of syndromes caused by ensuing error vectors [

*E*]

_{i+1}, [

*E*]

_{i+2}, ..., [

*E*]

_{i+l}in adjacent observed vectors. At any time, a single burst of errors is limited to set [

*E*]

_{ m }, correction is possible by separating the error effects. These errors in [

*E*]

_{ m }are recognized with the top

*n*items in [

*S*]

_{ m }.

*E*]

_{i+1}, [

*E*]

_{i+2}, ..., [

*E*]

_{i+l}, their accumulate contribution is in a separate subspace never permitting the syndrome vector [

*S*]

_{ m }to be all zeros. The beginning of errors, even if they overwhelm the correcting capability of the code, can be detected. This distinction between correctable and only detectable error bursts is achieved by applying an annihilating matrix, denoted ${F}_{0}^{T}$, which is

*n*× 2

*n*and has a defining property, ${F}_{0}^{T}{H}_{0}={0}_{n}$. Hence, it is possible to check whether a syndrome vector [

*S*]

_{ m }represents correctable errors, ${F}_{0}^{T}$. [

*S*]

_{ m }= 0, then [

*S*]

_{ m }obtain correctable model. From (1) for an optimum burst-error correcting code,

*b/a =*(1

*+ R*)/(1

*- R*). For the preceding case with

*R =*(

*n*- 1)

*/n*and

*b*=

*m.n*=

*m.a*, this implies that

*H*

_{0}is an

*n*× 2

*n*matrix. We must choose

*H*

_{0}such that the conditions for burst-error correction are satisfied. If we choose the first

*n*columns of

*H*

_{0}to be the skewed

*n*×

*n*identity matrix, then (9) implies that each error sequence with

*e*

_{0}≠ 0 and

*e*

_{ d }= 0,

*d*= 1, 2, ⋯,

*m*will yield a distinct syndrome. In this case, we obtain the estimate of simply by reversing the first

*n*bits in the 2

*n*-bit syndrome. In addition, for each

*e*

_{0}≠ 0, the condition

*e*

_{ d }≠ 0. This ensures that an error in some other blocks will not be confused for an error in block zero. For any

*e*

_{ d }≠ 0 and

*d*≥

*n*, the first

*n*positions in the vector

*e*

_{ d }

*H*

_{0}

*T*

^{ d }must be zero, since

*T*

^{ d }shifts

*H*

_{0}such that

*H*

_{0}

*T*

^{ d }has all zero in its first

*d*columns; however, for any

*e*

_{ d }≠ 0, the vector cannot have all zeros in its first positions. Hence, condition (13) is automatically satisfied for

*n*≤

*d*≤

*m*,

*m*= 2

*n*-1, and we replace (13) with the condition that for each

*e*

_{0}≠ 0,

## 5. Decoding and corrector system

*n*×2

*n*matrix

*H*

_{0}. Hence,

*e*

_{0}≠ 0 and

*e*

_{ d }= 0,

*d*= 1, 2, ⋯,

*m*[

*S*]

_{ m }are codeword in the (2

*n*,

*n*) block code generated by

*H*

_{0}; however, if

*e*

_{0}= 0 and a single block

*e*

_{ d }≠ 0 for some

*d*, 1 ≤

*d*≤

*m*, condition (13) ensures that [

*S*]

_{ m }is not a codeword in the block code generated by

*H*

_{0}. Therefore,

*e*

_{0}contains a correctable error pattern if and only if [

*S*]

_{ m }is a codeword in the block code generated by

*H*

_{0}. This requires determining if [

*S*]

_{m}. ${H}_{0}^{T}=0$ is the

*n*×2

*n*block code parity check matrix corresponding to

*H*

_{0}. If [

*S*]

_{ m }${H}_{0}^{T}=0$, the decoder must then find the correctable error pattern that produced the syndrome [

*S*]

_{ m }. Because in this case [

*S*]

_{ m }=

*e*

_{0}

*H*

_{0}, we obtain the estimate of simply by reversing the first

*n*bits in [

*S*]

_{ m }. For a feedback decoder, the syndrome must then be modified to remove the effect of

*e*

_{0}. But, for a correctable error pattern, [

*S*]

_{ m }=

*e*

_{0}

*H*

_{0}depends only on

*e*

_{0}, and hence when the effect of

*e*

_{0}is removed the syndrome will be reset to all zeros. Error correction system provides a more detailed view of some subassemblies in Figure 2 (see Figure 5). The processed data ${\stackrel{\u0304}{d}}_{i}$ can include errors ${\u0113}_{i}$ and the error correction system will subtract their estimates ${\u0113}_{i}^{\prime}$ as indicated in the corrected data output of the error correction system. If one of the computed parity values, ${\stackrel{\u0304}{p}}_{{u}_{i}}$ or ${\stackrel{\u0304}{p}}_{{l}_{i}}$ in Figure 5, comes from a failed subsystem, the error correction system's inputs may be incorrect. Since the data are correct under the single failed subsystem assumption, the data contain no errors and the error correction system is operating correctly. The error correction system will observe the errors in the syndromes and properly estimate them as limited to other positions. In addition, an excessive number of error estimates {${\u0113}_{i}^{\prime}$} could be deducted from correct data, yielding {${\stackrel{\u0304}{d}}_{i}$ - ${\u0113}_{i}^{\prime}$} values at the Error Correction System's output, which the regeneration of parity values produces {${\stackrel{\u0304}{p}}_{{u}_{i}}^{\prime}$}. There are several indicators that will detect errors in the error correction system's input syndromes {${\stackrel{\u0304}{s}}_{i}$}.

## 6. Simulations and results

### 6.1. Design evaluation

*τ*,

*S*= ${\stackrel{\u0304}{p}}_{{l}_{i}}$ - ${\stackrel{\u0304}{p}}_{{u}_{i}}$ if $\left|{\stackrel{\u0304}{p}}_{{l}_{i}}-{\stackrel{\u0304}{p}}_{{u}_{i}}\right|<\tau $ then there are no errors. If the threshold

*τ*is set too low, even occasional round-off errors will exceed it, indicating failures leading to recomputation unnecessarily. It is generally permissible to accept a few small errors that are in the range of round-off levels. Nevertheless, the simulations examine how the threshold choice impacts undetected errors. Errors are detected by examining the magnitude of the respective syndromes and comparing against thresholds five times the standard deviation of syndrome values when only low levels of round-off error appear. The simulation program randomly selects the line in a magnitude error is superimposed. The magnitude of each error is chosen from a Gaussian population with zero mean and fixed variance. For small thresholds, large errors always lead to detection, whereas large thresholds increase the undetected error performance. The threshold was varied over a wide range so as to see the transition between low detected errors and high levels of missed errors. However, for a simulation, the error-detecting capabilities are interrelated with the variance of the simulated computer-induced errors. The probability of undetected errors when errors are present is evaluated as the ratio of threshold to error variance is varied over several orders of magnitude. The results are shown in Figure 6. The input data size is

*k*= 100 samples. The error magnitude variance is taken as 10

^{-3}so that, probabilistically, only small errors are superimposed. At very low thresholds, the experimental probability of undetected errors is zero. The values are not displayed on the smallest part of the abscissa. The curves shown in Figure 6 never have any undetected error until the threshold 5, when the first undetected probability is 1.1 × 10

^{-4}. Two longer simulations using 10

^{6}samples are performed for two low thresholds of 2 × 10

^{-3}and 2 × 10

^{-5}. The undetected error rate is 4.86 × 10

^{-7}when the threshold is 2 × 10

^{-5}. For the slightly higher threshold of 2 × 10

^{-3}this error rate is 4.724 × 10

^{-5}.

By comparing the differences between the two parity values ${\stackrel{\u0304}{p}}_{{u}_{i}}$ and ${\stackrel{\u0304}{p}}_{{l}_{i}}$, we can show the checking system responding to error.

### 6.2. Mean square error performance

^{-3}to 10

^{-8}. The insertion error rate is

*p*= 5 × 10

^{-3}. The average MSE plots shown in Figure 9 display the values for input errors as well as those for corrected code. The input mean-squared values for input errors are very similar by statistical regularity while the corrected MSEs are much lower since large errors have been eliminated. Furthermore, the code seems quite capable of correcting all errors. The differences between input error mean-squared values and its corrected version can be evaluated by taking a ratio of their mean-squared levels.

### 6.3. Examples and simulations

A BP burst-correcting convolution code (6, 5, 11) is constructed [16] for use with a fault-tolerant processing situation. A rate 1/3 (3, 1, 10) code is chosen from a standard text [16] which have a constraint parameter *m* = 10. Long simulations involving 250, 000 blocks of data over a wide range of variances are performed. For the rate 1/3 code, this represented 750, 000 samples, while for the rate 5/6 code case it implied 1.5 million samples. Burst and errors within each block are permitted. A burst in this context means that the standard deviations of all components in a block are raised to 10% of the maximum standard deviation. On the other hand, when a burst is not active, errors are allowed with positions within a block chosen independently at random, and those selected had their standard deviations raised to 10% of their maxima. The probability of a burst is 5 × 10^{-3}, while intra block errors have probability 10^{-3}. For long simulations, the basic parameter *σ*^{2} (variance of error) is changed from 10^{-9} up to 3.2.

## 7. Conclusions

This article addresses new methods for performing error correction when real convolution codes are involved. Real convolution codes can provide effective protection for data processing operations at the data-parity level. Data processing implementations are protected against both hard and soft errors. The data processing system is protected through parity sequences specified by a high rate real convolution code. Parity comparisons provide error detection, while output data correction is affected by a decoding method that includes both round-off error and computer-induced errors. The error detection structures are developed and they not only detected subsystem errors, but also corrected errors introduced in the data processing system. Concurrent parity values techniques are very useful in detecting numerical error in the data processing operations, where a single error can propagate to many output errors. Parity values are the most effective tools used to detect burst errors occurring in the code stream. The detection performance in the data processing system depends on the detection threshold, which is determined by round-off tolerances. The structures have been tested using MATLAB programs and compute error detecting performance of the concurrent parity values method and simulation results are presented.

## Notes

### Acknowledgements

The authors are grateful to the comments from Mrs. Mahbobeh Meshkinfam that significantly improved the quality of this article.

## Supplementary material

## References

- 1.Huang KH, Abraham JA:
**Algorithm-based fault tolerance for matrix operations.***IEEE Trans Comput*1984,**C-33:**518-528.CrossRefGoogle Scholar - 2.Jou JY, Abraham JA:
**Fault tolerant matrix arithmetic and signal processing on highly concurrent computing structures.***Proc IEEE*1986,**74**(5):732-741.CrossRefGoogle Scholar - 3.Jou JY, Abraham JA:
**Fault-tolerant FFT networks.***IEEE Trans Comput*1988,**37:**548-561. 10.1109/12.4606CrossRefGoogle Scholar - 4.Banerjee P, Rahmeh JT, Stunkel CB, Nair VSS, Roy K, Abraham JA:
**Algorithm-based fault tolerance on a hypercube multiprocessor.***IEEE Trans Comput*1990,**39:**1132-1145. 10.1109/12.57055CrossRefGoogle Scholar - 5.Rexford J, Jha NK:
**Algorithm-based fault tolerance for floating-point operations in massively parallel systems.***Proc Int Symp on Circuits & Systems*1992, 649-652.Google Scholar - 6.Nair VSS, Abraham JA:
**Real number codes for fault-tolerant matrix operations on processor arrays.***IEEE Trans Comput*1990,**39:**426-435. 10.1109/12.54836CrossRefGoogle Scholar - 7.Bosilca G, Delmas R, Dongarra J, Langou J:
**Algorithm-based fault tolerance applied to high performance computing.***J Parallel Distrib Comput*2009,**69**(4):410-416. 10.1016/j.jpdc.2008.12.002CrossRefGoogle Scholar - 8.Roche T, Cunche M, Roch JL:
**Algorithm-based fault tolerance applied to P2P computing networks.***ap2ps, 2009 First International Conference on Advances in P2P Systems*2009, 144-149.CrossRefGoogle Scholar - 9.Redinbo GR:
**Generalized algorithm-based fault tolerance: error correction via Kalman estimation.***IEEE Trans Comput*1998,**47**(6):1864-1876.CrossRefGoogle Scholar - 10.Redinbo GR:
**Failure-detecting arithmetic convolution codes and an iterative correcting strategy.***IEEE Trans Comput*2003,**52**(11):1434-1442. 10.1109/TC.2003.1244941CrossRefGoogle Scholar - 11.Redinbo GR:
**Wavelet codes for algorithm-based fault tolerance applications.***IEEE Trans Depend Secure Comput*2010,**7**(3):315-328.CrossRefGoogle Scholar - 12.Redinbo GR:
**Systematic wavelet Sub codes for data protection.***IEEE Trans Comput*2011,**60**(6):904-909.MathSciNetCrossRefGoogle Scholar - 13.Moosavie Nia A, Mohammadi K:
**A generalized ABFT technique using a fault tolerant neural network.***J Circ Syst Comput*2007,**16**(3):337-356. 10.1142/S0218126607003708CrossRefGoogle Scholar - 14.Baylis J:
*Error-Correcting Codes: A Mathematical Introduction.*Chapman and Hall Ltd; 1998.CrossRefGoogle Scholar - 15.Veeravalli VS:
**Fault tolerance for arithmetic and logic unit.***IEEE Southeastcon 09*2009, 329-334.Google Scholar - 16.Costello D, Lin S:
*Error Control Coding Fundamentals and Applications.*2nd edition. Pearson Education Inc., NJ; 2004.Google Scholar - 17.Morelos-Zaragoza RH:
*The Art of Error Correcting Coding.*2nd edition. Wiley; 2006.CrossRefGoogle Scholar - 18.Viterbi AJ, Omura JK:
*Principles of Digital Communication and Coding.*2nd edition. Mc-Grawhill; 1985.Google Scholar - 19.Berlekamp ER:
**A class of convolution codes.***Inf Control*1962,**6:**1-13.MathSciNetCrossRefGoogle Scholar - 20.Massey JL:
**Implementation of burst-correcting convolution codes.***IEEE Trans Inf Theory*1965,**11:**416-422. 10.1109/TIT.1965.1053798MathSciNetCrossRefGoogle Scholar - 21.Lee LHC:
*Convolutional Coding: Fundamentals and Applications.*Artech House; 1997.Google Scholar

## Copyright information

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.