In this chapter, three high-level reliability estimation techniques are illustrated which fast characterize the effects of errors on processor architecture. In Sect. 5.1 an analytical estimation technique is presented to quantify the vulnerability and logic masking capability of individual circuit elements while calculating instruction and application level error rates. In Sect. 5.2 Probabilistic Error Masking Matrice is introduced to predict error effects through the graph network of dynamic processor behavior. In Sect. 5.3 design diversity metric is illustrated to evaluate the robustness of redundant system against common mode failures for system-level processing components.

5.1 Analytical Reliability Estimation Technique

Complementing the simulation techniques using fault injection, analytical techniques have also been proposed to investigate the behavior of circuits under faults. Mukherjee et al. [138] introduced the concept of architecturally correct execution (ACE) to compute the vulnerability factors of faulty structures. In [21] the authors performed the ACE analysis to compute architectural vulnerability factors for cache and buffers. Recently, Rehman et. al [153, 164] extended the ACE concepts to instruction vulnerability analysis and proposed reliability-aware software transformations. The vulnerability of the instruction is analyzed in this work by studying the constituent logic blocks and possibly connect with the circuit-level reliability analysis [162]. While the instruction vulnerability index model proposed at [153] includes the logical masking effects, the details of the derivation of the masking effect are not mentioned. The simulation accuracy is compared with other software-level reliability estimation flows [153].

Contribution In this work, an analytical technique is proposed to estimate the application dependent reliability of embedded processors and benchmark its usage on fault evaluation with an instruction set simulation-based fault simulation technique in Sect. 4.1. Figure 5.1 shows the contributions where the novel modules are filled in the dark color. The simulation-based reliability estimation technique is performed for both RTL and ADL abstraction layers. The analytical technique takes the instruction profiling of the target application and fault simulation results at either abstraction layer as inputs. Such results are used to calculate the operation fault properties and Instruction Error Rate (IER) which are then processed by the reliability estimator to predict the Application Error Rate (AER). Users can improve LISA models and target applications to tune the AER, which closes the reliability estimation/exploration loop.

Fig. 5.1
figure 1

Copyright ©2013 IEEE

ADL driven reliability estimation flow [216]

To present the analytical technique, the operation reliability model is explained first, which is applied in the following to calculate instruction error rate. Then the application error rates are derived by profiling the target applications. The exemplary analysis is carried on the 5-pipeline stages RISC processor model, which is available via [184].

5.1.1 Operation Reliability Model

Directed Acyclic Graph (DAG) is used to represent the activation chain of LISA operations. To represent fault injection and error propagation, data flows have to be added in the DAG. Figure 5.2 shows the data flow graph for the ALU instruction. While the nodes represent LISA operations the edge between them shows the data flow with an individual index and corresponding signal names. When a transient fault is injected into an operation, it needs to first manifest on the operation’s output edges and then propagate through following operations until it manifests on the output of the Writeback operation to result in an instruction level error. Notice that not all faults will result in an instruction level error due to logic masking effect. Consequently, the operation error probability and masking probability are proposed to model such process.

Fig. 5.2
figure 2

Copyright ©2013 IEEE

Data flow graph for ALU instruction [216]

Operation error probability \(C_{op}^{e}\) is the probability of a detected error on the output edge e of an operation when a fault is injected inside its operation.

Operation masking probability \(M_{op}^{e\_in, e\_out}\) is the probability of a detected error on the output edge \(e\_out\) of an operation when a fault is injected in its input edge \(e\_in\).

Each operation has both \(C_{op}^{e}\) and \(M_{op}^{e\_in, e\_out}\) to represent the situation of fault injection on it and error propagation through it respectively. For a particular architecture model, single bit fault is injected through disturbance signals inside of each operation randomly in time and location. By tracing the output edges and comparing the traced value with golden simulation, it is easy to get \(C_{op}^{e}\) when a large number of simulations are performed to counter the randomness. \(M_{op}^{e\_in, e\_out}\) can also be acquired when faults are injected to the input edges while output edges are traced and compared. Pure analysis on the data flow graph of combinational logic inside each operation instead of simulation method can also predict its \(C_{op}^{e}\) and \(M_{op}^{e\_in, e\_out}\) value, which will be proposed in the future work.

5.1.2 Instruction Error Rate

The path error probability is the product of \(C_{op}^{e}\) and the \(M_{op}^{e\_in, e\_out}\) of its following operations on the same path from the fault injected operation to the sink operation. The instruction error rate \(IER_{insn}^{op\_faulty}\) for operation \(op\_faulty\) and for instruction insn is defined as the summation of all path error probabilities. For example Eq. 5.1 shows the instruction error rate when operation Fetch in Fig. 5.2 is fault injected. The edges in the equation are labelled by their indexes.

$$\begin{aligned} \begin{aligned} IER_{alu}^{fetch} =&C_{fetch}^{1}M_{decode}^{1, 2}M_{writeback}^{2, 7}+ \\&C_{fetch}^{1}M_{decode}^{1, 3}M_{alu\_ex}^{3, 6}M_{writeback}^{6, 7}+ \\&C_{fetch}^{1}M_{decode}^{1, 4}M_{alu\_dc}^{4, 5}M_{alu\_ex}^{5, 6}M_{writeback}^{6, 7} \\ \end{aligned} \end{aligned}$$
(5.1)

The method above to calculate the instruction error rate can be applied to all instructions which are defined as a chain of activated operations in LISA. An instruction error can result from a fault injected in each preceding operation in the instruction data flow graph. So that the error rates for a particular instruction constitute a set of \(IER_{insn}^{op\_faulty}\) where \(op\_faulty\) is one of the activated operations for insn. Besides the operations, the edges between them can also be faulty, which resembles the situation when a fault is injected on storage resources such as signals and registers. Such resources have essentially both error and masking probabilities equal to one since no masking effect exist for the resources so that they propagate any encountering fault. In this work the SEU errors caused within the resources are not considered since the analysis primarily is focused on those caused inside combinational logic.

Fig. 5.3
figure 3

Copyright ©2013 IEEE

Operation graph for all instructions in RISC processor [216]

5.1.3 Application Error Rate

The application error rate \(AER_{app}^{op\_faulty}\) represents the error probability when a fault is injected inside operation \(op\_faulty\) during the execution of a specific application app. When the error rates for all the instructions are known, the application error rate is defined to be the weighted average of all instruction error rates, where the weight of each instruction is its execution counts versus the total instruction counts of the whole application. Figure 5.3 shows the DAG for all instructions of the RISC processor model. Several instructions which have similar operand behaviors are grouped into the same operation for simplicity. Each instruction corresponds to a path starting from operation Fetch to its sink operations, which interact with resources such as register file or memories. The weights of instructions are labeled as pi, which can be acquired from the application profiler. As an example, the application error rate of \(alu\_rrr\_ex\) operation is shown in Eq. 5.2. The summation happens since the operation is on the activation chain of two instructions \(alu\_rrr\) and \(alu\_rrri\).

$$\begin{aligned} AER_{app}^{alu\_rrr\_ex} = p_{app}^{1}IER_{alu\_rrri}^{alu\_rrr\_ex}+p_{app}^{2}IER_{alu\_rrr}^{alu\_rrr\_ex} \end{aligned}$$
(5.2)

The application error here is detected through the mismatch of instruction results, either committed values to register files or load/store values to memories, with the golden simulation. This provides a conservative estimate of the error rate in program’s output, which is normally the value sent by the processor through I/O instructions. The error in the current setup may not lead to an I/O error. This can be caused by several factors. First, the erroneous value committed to architecture registers can be masked by following instructions before I/O access. Second, affected operations which are not activated can be irrelevant to the value finally sent through I/O. Besides, the hardware bypass features in the processor can also silent the interface error since the source of operands can be the bypassed value from pipeline registers instead of architecture registers. In this case, an error occurring at the writeback value after it is bypassed to later instructions may also not result in an error. However, the proposed analysis offers a fast method to determine to what extent a fault injected operation can potentially influence the program output so that engineers can adopt software or hardware measures to improve system reliability.

5.1.4 Analytical Reliability Estimation for RISC Processor

In this section, the reliability analysis based on the proposed methodology is presented. First, the estimation of IER for individual operation is shown. In the next AER is calculated from IER and application dependent weights of instructions. The estimated values are compared with experimental values.

5.1.4.1 IER

A set of testbenches are developed to get individual \(IER_{insn}^{op\_faulty}\). Each testbench contains the same type of instructions with different modes and random operands. The single bit-flip fault with duration 1 clock cycle targeting a specific operation is then injected during each simulation. Mismatches can be easily detected when both faulty and golden simulations are performed. Each operation specific \(IER_{insn}^{op\_faulty}\) is obtained from 3000 simulations. The IER can also be derived analytically from Eq. 5.1, where \(C_{op}^{e}\) and \(M_{op}^{e\_in, e\_out}\) need to be obtained based on fault simulations. Here the experimental value is simply applied for higher estimation accuracy.

Table 5.1 shows \(IER_{insn}^{op\_faulty}\)s of instruction \(alu\_rrr\) as an example. Table 5.1 also shows the application dependent weights of instructions for Sobel. The weights are used to calculate \(p\cdot IER\), which constitutes one portion of the AER in Eq. 5.2. Such weights can be obtained directly by the profiling tools of Processor Designer. Note that \(alu\_rrr\_dc\) and \(alu\_rrr\_ex\) operations are subdivided into several modes. This is because different modes of the same instruction type have distinct IERs and weights. The IER among different modes is the weighted average of IERs for all modes.

Table 5.1 Instruction-level reliability estimation [216]

5.1.4.2 AER

When \(IER_{insn}^{op\_faulty}\)s for all operations and instructions are obtained from the testbenches, Eq. 5.2 is applied to estimate \(AER_{op\_faulty}^{app}\) based on the application profiling. Table 5.2 shows the estimation, experimental values and also relative deviation between both values averaged for three selected applications. In each experiment, one single bit fault with duration 1 clock cycle is injected randomly in time and location into the target operation. All analytical reliability estimation values can be obtained through one single simulation which consumes a negligible amount of time while each experimental value comes from 10, 000 LISA level fault simulation experiments, which requires around five hours each for Sobel and FFT and 12 h for IDCT. This is a significant improvement in the productivity and facilitates exploration by the application developer like the optimizations proposed in [164]. Naturally, for any change in the processor datapath or storage, the analytical model parameters need to be recomputed via benchmarking against instruction-set simulation-based or RTL-based reliability estimation flow.

Table 5.2 Reliability estimation for selected applications [216]

Generally, for all three applications the estimated and experimental AER values of the same operation are close to each other. Regarding individual \(AER_{op\_faulty}^{app}\), fetch, decode and writeback_dst operations are apparently more vulnerable than the others since they reside on the paths of many operations. Besides, address_generation shows highest AER among all other operations, this happens since it is activated by load and store operations with direct access to the resources. Nop shows 0 error rates since it contributes nothing to the program execution. Compared among different applications, ldc_ri_dc is more vulnerable in FFT since coefficients are more frequently loaded in FFT than the others, while Sobel suffers more from faults in alu_rri_dc and alu_rri_ex since the compiler generates more assembly codes for calculation with immediate values.

For estimation accuracy, the results of operations with higher AER values show better matches. This happens since frequently called operations are more robust to the randomness during fault injection. Besides, AERs of operations which involve conditional behaviors such as cmp_rr and bra are highly dependent on the application characteristics, which makes it difficult to predict from IERs obtained using a standard testbench.

5.1.5 Summary

In this work, an analytical reliability estimation technique is presented, which facilitates fast reliability estimation for the target processor architecture with sufficient accuracy compared with instruction-set simulation-based estimation. The estimation accuracy of both the techniques is demonstrated through several embedded applications on an RISC processor and by benchmarking against an high-level fault injection.

5.2 Probabilistic Error Masking Matrix

The design of reliable system in presence of faults is a challenging problem, which requires the understanding of the causes and effects of failures such as radiation and electromigration. Moreover, reliability trades off with other design metrics [2, 59, 91, 160]. Recent research shows that separate error mitigation techniques from individual design abstractions may lead to over-protected system. Therefore it is desirable to treat reliability as a cross-layer design issue [47]. For instance, the architectural tolerant fault technique should take advantage of circuit-level and algorithmic error resilience [74, 146]. However, cross-layer exploration requires clear knowledge of the fault propagation through design abstractions. Using such knowledge, error properties such as injection time, location and probabilities could be approximately predicted even before tedious fault injection experiments.

In particular approximate error prediction is important for algorithmic reliability and inexact, probabilistic computing [141]. Earlier research on this can be traced to the issue of floating-to-fix point conversion for DSP design [75]. However, there the error locations are limited to variables (sizes of fixed points) and operators (saturation, rounding effects), which neglects the concern on architectures. The framework of Probabilistic Transfer Matrix (PTM) proposed by Krishnaswamy [162] captures the probabilistic behavior of the circuit to estimate the error probability inside the circuit. However, PTM suffers from scalability issue for large design due to its granularity of single bit. In [131] a statistical error tracking approach named RAVEN is introduced to analyze cross-layer error effects. The DUE (Detected Unrecoverable Error) and SDC (Silent Data Corruption) outcomes for soft errors are predicted by RAVEN. However, RAVEN analyses error propagation of large micro-architecture blocks such as a pipeline stage using averaged masking statistics, which implies increased amount of error due to various logic masking effects which depend on runtime processor behaviors.

Contribution In this work, a novel algebraic representation called Probabilistic error Masking Matrix (PeMM) is proposed to address the masking effects on errors occurring at the inputs of the circuits. In contrast to the high computational complexity of PTM, PeMM requires very few calculation since it has initially granularity on the signal level. Fine-grained PeMM is also designed to calculate nibble-wise or byte-wise error probabilities. In the next, PeMM algebra has been integrated into LISA-based high-level processor design framework, where logic errors are represented as an abstract data structure of token. An automated analysis flow predicts the token propagation by a cycle-accurate instruction set simulator while PeMM addresses the error masking effects for micro-architecture components. Several optimization techniques are introduced to increase the prediction accuracy, which heavily depends on the control states of the architecture.

5.2.1 Logic Masking in Digital Circuits

Faults within logic circuits are masked with certain probability before propagating as output errors. Such masking effects are caused by:

  • Logic primitives containing arithmetic operators have inherent error masking abilities.

  • Micro-architecture features can ignore the erroneous data, such as data bypassing and branch prediction.

  • Errors in architecture resources such as registers and memory elements can be never used or overwritten before being read

PTM [162] calculates error probability of outputs for faults inside the circuits Fig. 5.4a. It suffers from scalability problem since PTM has the matrix size of \(2^n \times 2^m\) where n and m are the total number of bits for inputs and outputs. Derivation of PTM for large design is performed by accumulating PTMs for individual logic gates, which is infeasible for modern VLSI.

Fig. 5.4
figure 4

Copyright ©2015 IEEE

Faults in logic circuits [207]

Probabilistic error Masking Matrix (PeMM) primarily handles the case in Fig. 5.4b where the faults locate at inputs of circuits. PeMM only has the matrix size of \(m \times n\) for a circuit with n bits input and m bits output. PeMM can be further compressed when n and m represent the number of input and output signals.

5.2.1.1 PeMM Definition

For a circuit with n inputs and m outputs which are labelled as \(in_0, \ldots in_{n-1}\) and \(out_0, \ldots out_{m-1}\) respectively, the PeMM P of the circuit has a dimension of \(m \times n\). Each element \(p(out_i, in_j)\) indicates the error probability on output \(out_i\) with regard to input \(in_j\) with \(100\%\) error, where \(i\in [0, m-1]\) and \(j\in [0, n-1]\). \(p(out_i, in_j)\) equals 0 represents a complete error masking while 1 implies no masking at all. \(e_{out_i}\in [0, 1]\) implies the truncation of error probability when it is larger than one. Elements in inputs I(j) represents the error probability \(e_{in_j}\) on input \(in_j\). The output vector is vector with dimension \(m \times 1\) with elements showing the error probability \(e_{out_i}\) on the output. Figure 5.5 visualizes the PeMM for abstract circuit model.

Fig. 5.5
figure 5

Copyright ©2015 IEEE

Probabilistic error Masking Matrix (PeMM) [207]

PeMMs characterize error masking effects of micro-architecture components. The circuit PeMM is evaluated as the concatenation of PeMMs for sub-components. The architecture components for the ALU instructions and data signals among them are shown in Fig. 5.6. The dimensions of component-wise PeMMs are labeled in bold color. The propagated tokens are indicated by the rounded red dot, which represents error data with probabilities.

Fig. 5.6
figure 6

Copyright ©2015 IEEE

Logic blocks involved for ALU instruction [207]

5.2.2 PeMM for Processor Building Blocks

5.2.2.1 Combinational Logic Blocks

PeMM tackles the masking effect of the circuit by a linear transformation. However, such approach does not handle the logic blocks with internal data dependencies. One solution is to decompose larger circuits into logic sub-blocks with individual PeMMs according to their data dependencies. Figure 5.7 indicates PeMM decomposition of large logic block alu_ex into 3 sub-blocks. Signals alu_in1 and alu_in2 connect alu_ex_1 and alu_ex_2 while alu_out connects alu_ex_2 and alu_ex_3. Following this approach, PeMMs for sub-blocks with no data dependencies inside can be characterized individually. An intra-token pool is used to keep the temporary tokens for further processing inside large logic blocks. The intra pool shows the fact that such tokens can not be accesses by other logic blocks.

Fig. 5.7
figure 7

Copyright ©2015 IEEE

Decomposition of large logic block using PeMM [207]

5.2.2.2 Control Flow Inside Logic Block

Non-linear operators inside logic block such as multiplexers generated from control flow reduce the prediction accuracy of PeMM. This results in a significant difference of masking probability compared with random characterization. For instance, the circuits shown in Fig. 5.8 contains a 3-to-1 multiplexer from the conditional statements. Random characterization for the highlighted PeMM elements give the value of \([0.33\ 0.33\ 0.33]\), which is false for the real masking due to the exclusiveness of multiplexer. To solve this, additional helper_signals are declared to indicate dynamically active branch and fill the correct PeMM elements. For example, vector \([1\ 0\ 0]\) is filled when the first branch of the if statement is active, which correctly shows that the error from the first branch propagates to output directly and errors on other branches are masked completely.

Fig. 5.8
figure 8

Copyright ©2015 IEEE

Control flow handling for PeMM [207]

5.2.2.3 Sequential Logic and Memory

Sequential logic (RegisterFile and pipeline registers) and memory block show no logic masking effects on their inputs. Identity Matrix \(I_{m \times m}\) can be used to model PeMM directly, where m is the number of inputs and outputs. For pipeline registers, errors on inputs are propagated to outputs during pipeline shift. For RegisterFile, input errors are stored during write access and loaded during read. Similarly, PeMM for memory is modeled by identity matrix with m equalling to the count of storage cells. Noted that sequential logic has strong timing masking effect, such as timing error caused by setup/hold violation. Such factor is currently not containing in behavioral PeMM and will be integrated during future work.

5.2.2.4 Inputs with Multiple Faults

Multiple errors on PeMM inputs also affects its accuracy. Matrix multiplication with input vector accumulates the contribution of all input errors, which achieves good masking accuracy for most arithmetic operators. However, correlated input errors which are partially or completely generated from the same error can cancel their error effects depending on the arithmetic operators. For instance, a strong error cancellation effect exists for XOR operator with bit-flip errors at same bit position of both inputs. Ideally, for multiple input errors, a new set of PeMM should be adopted which gives additional modeling effort. However, since such case is relatively rare, PeMMs with single input errors are still applied to give a worst case estimation.

5.2.3 PeMM Characterization

Statistical simulation is used to characterize PeMM elements when primary inputs of logic blocks are injected with errors. High-level languages such as C based test-benches embeds behavioral description of circuit. The probability \(M^{out_i}_{in_j}\) can be acquired by averaging the error probability on \(out_i\) among multiple experiments, where randomly single bit-flip error is injected on input \(in_j\).

5.2.3.1 Accuracy of PeMM Characterization

In order to characterize the PeMM elements with the desired confidence level, the number of random experiments is determined according to [42] by randomizing input values and bit position of errors. For a circuit under test with n inputs of m bits each, the space size of input randomness is \(2^{m \times n}\). The overall size of random experiments with random bit error position equals \(2^{m \times n} \times m\). For example, a circuit with 2 inputs of 32 bits for each, needs 9604 experiments to produce the PeMM element for 95% confidence level with confidence interval of 1.

5.2.3.2 Fine-Grained PeMM

To trade off prediction accuracy and modeling complexity, PeMM can be extended to model errors on finer granularities, such as byte or nibble levels. Therefore, not only existence of errors on signal can be predicted but also the error distribution across the bits of signal. This can be of importance for prototyping of algorithms and architectures for approximate computing.

Fine-grained PeMM can be created using additional look-up-table for values of \(M^{out_i}_{in_j}\) as in Table 5.3, where byte-level masking probabilities for selected algorithmic operations are listed. The first column represents the targeted operations while the second column forms a Key variable showing in which bytes the faults locate for both inputs of logic primitives. For instance, key 13 shows faults in \(1^{st}\) byte of first input and \(3^{rd}\) byte of second input while key 10 shows no fault in second input but only \(1^{st}\) byte of the first input. The byte-wise \(M^{out_i}_{in_j}\) shows the probabilities of error existence in particular output bytes. Depending on targeted field of application, granularity can be further fine-grained, which requires additional efforts for characterization. Such as single input fault in \(1^{st}\) byte of SUB operation can result in errors in \(2^{nd}\) or even \(3^{rd}\) bytes with reduced probability, whereas for AND operation no cross bytes error could result from single input fault. When faults exist in multiple bytes of the same input, expected masking probabilities could be interpolated based on byte-level error probabilities.

Table 5.3 Examples of PeMM elements with byte-level granularity [207]

Figures 5.9 and 5.10 shows the examples of byte-level and nibble-level PeMM. Each single element in word level PeMM is expanded as \(4 \times 4\) sub-matrix in byte-level PeMM and \(8 \times 8\) sub-matrix in nibble-level PeMM. The indexing label out / in represents the sub-matrix with regard to the input signal in and output signal out. The overall error probabilities on a specific segment of signal out are the sum of contribution from propagated error through all sub-matrix which has the same output signal and segment. Take the element alu_out/alu_in1 for instance, it is observed that the expansion of error into neighbor segments with reduced error probabilities once upon fault is injected in a single segment. Furthermore, nibble-level PeMM shows the cross-section error propagation more clearly since mismatches on finer segments are characterized.

Fig. 5.9
figure 9

Byte-level PeMM

Fig. 5.10
figure 10

Nibble-level PeMM

5.2.4 Approximate Error Prediction Framework

The PeMM based algebraic operation is integrated with LISA-based processor design flow [184] to establish an approximate error prediction framework for generic architecture. Other simulators using ADL such as Verilog and SystemC can also take advantage of this technique. Figure 5.11 presents an overview of the framework.

Fig. 5.11
figure 11

Copyright ©2015 IEEE

Error tracking and prediction framework [207]

The flow is composed of the preparatory and execution stage. In preparatory stage, cycle accurate instruction-set simulator (ISS) is generated from processor description using ADL LISA [1] with user provided applications. The simulator is extended with fault injection technique as in Sect. 4.1. An additional parser of LISA source code is used to extract the behavior section of LISA operations and the inputs and outputs resources for individual architecture units. The PeMM characterization module wraps the behavior of processor architecture components into C-based testbenches with interface signals as function arguments. PeMMs are fast characterized in such testbenches with random inputs and faults. The LISA parser supports language pragmas for extended PeMM characterization according to Sects. 5.2.2.1, 5.2.2.2 and 5.2.3.2.

In the execution stage, the user injects token with graphical interface or with description by XML file. The token data structure indicates error probability, along with elements representing the micro-architectural location and timing which are required to track the token during propagation. PeMMs algebra is called by active logic units to calculate output error probabilities, while inactive logic units completely mask their input token. The final report contains predicted errors by the end of simulation, as well as the detailed paths of token propagation and error masking conditions.

5.2.4.1 Error Representation

Compared to faults and errors, token injection does not alter the resource values but annotate an error probability which is initially set to one. The token is removed when its error probability is masked to 0. To fetch the correct token, hardware resource ID and array index are updated together with error probabilities. Specific hardware resources are able to contain multiple sub-tokens. For instance the instruction register contains sub-tokens in each of its decoding fields such as opcode, source and destination operands.

5.2.4.2 Token Tracking

As no actual errors are injected by the tokens, the simulator remains correct execution and indicates potential errors. Algorithm 1 describes the token tracker called at each clock cycles. The algorithm begins with the activation checking of LISA operations. If any activated operation has inputs containing tokens, PeMMs are applied to update and propagate tokens to the outputs. Due to synchronized hardware behaviors, the tokens are scheduled for creation and removal at the end of that cycle. Besides activation analysis for operations, the tokens in pipeline registers are forwarded to the next pipeline stage. However, forwarded tokens are overwritten by the ones created from the active operations if there is any. Old tokens in memory and register files are replaced by new ones if they are not read out before overwritten.

5.2.5 Results in Error Prediction

Several case studies on an embedded RISC processor from Synopsys Processor Designer [184] are used to demonstrate the proposed approximate error prediction framework. The processor has five pipeline stages with full data bypassing and forwarding functionality. Both RTL models and simulators are generated automatically.

5.2.5.1 Error Prediction Report

Token tracking analysis is carried out on an assembly program consisting of algorithmic and memory access instructions. Tokens are created at different hardware resources where the error prediction reports are documented in Table 5.4. For instance, the first group shows that the created token with ID 1 expands into totally 33 tokens dynamically. Only 8 tokens live until the end of simulation while the rest ones have been removed or overwritten. The token in processor core is stored into data memory for memory access instruction. On the contrary, the token in second group results in one error in data memory although just 3 tokens are expanded. In the last group, although 6 tokens are expanded, no token has been stored into data memory so that no application-level errors are visible. Based on the error prediction reports the user can easily perform vulnerability analysis for specific hardware resources in the architecture.

Table 5.4 Example of Error Prediction Report

The report also indicates word-level and nibble-level error probabilities. The two sets of error probabilities differ from each other for absolute values since they are calculated using separate word and nibble-level PeMMs respectively. It is noted that the errors are expanded into adjacent nibbles with reduced error probabilities due to the inter-nibble masking effects of algorithmic operations.

figure a

5.2.5.2 Accuracy and Speed-Up

The predicted error probability is benchmarked with Verilog-based fault injection [44]. The faults are able to be injected into physical resources such as RTL signals, pipeline registers, register file and memory arrays in Verilog description.

Accuracy Comparison for Different PeMM Modes In this experiment, a testbench processes data in a loop using general purpose registers and stores the final result into memory. Error prediction results with different modes of PeMM construction are benchmarked with Verilog fault injection. For each fault injection experiment, random inputs are generated with single bit-flip error at random bit position. 1, 000 experiments are performed to calculate the average error probabilities on selected hardware resources. In contrast, the proposed PeMM based analysis is performed in one run to generate the predicted error probabilities under the same input error.

Figure 5.12 indicates the results of prediction with different PeMM modes on selected hardware resources, which include registers R[1] to R[15] and final program output value to data memory. It is shown that the PeMM without matrix decomposition achieves the least accuracy compared with fault injection, whereas PeMM decomposition and usage of assistant signals for dynamic control flow prediction increase the accuracy significantly. Assistant signals, which cover all related logic blocks, help the PeMM to perfectly match the results of fault injection.

Fig. 5.12
figure 12

Copyright ©2015 IEEE

Error prediction accuracy on different modes of PeMM against Verilog-based fault injection [207]

Error Prediction for Embedded Applications Accuracy and timing advantage of the proposed framework are demonstrated against fault injection by several embedded benchmarks. One token/bit-flip error is created/injected in the same resource location at same time instances. Table 5.5 indicates the word level error probability on selected hardware resources at then end of application. The PeMM modes are configured to be under both matrix decomposition and assistant signals.

The error probabilities of fault injection approach the analytically predicted values as a number of experiments grows. The large sample of trials during PeMM characterization phase contributes to the prediction accuracy. Table 5.5 also compares the required time between PeMM prediction and fault injection. Token tracking achieves 25, 000x speed-up on average compared to fault injection of 2, 000 experiments.

Table 5.5 Accuracy and speed of prediction for embedded benchmarks [207]

5.2.5.3 Timing Overhead for Token Tracking

Overhead of Preparatory Stage The preparatory stage, which consists of parsing and characterization phases, generates PeMM for 42 operations automatically for the targeted processor. Table 5.6 presents the timing for the preparatory stage on the host machine of Intel Core i7 CPU at 2.8 GHz. 100, 000 characterizations are performed for each element in PeMM. Characterization phase consumes larger computational efforts than parsing due to its huge amount of random experiments. Analysis of advanced PeMM modes consumes extra time in both phases.

Table 5.6 Processing time for automated PeMM preparation [207]

Timing Overhead Against Number of Tokens Table 5.7 indicates the timing overhead by the token tracking against original instruction set simulation. Token tracker with no token injected adds 28.4% overhead in average due to the searching for token per clock cycle. Single injected token further increases 6.7% simulation overhead. 20 tokens add 79.3% overhead in average. During simulation, most of the tokens only have a life span of several cycles. Therefore the overhead does not scale linearly with a number of tokens. The tokens are managed in an unordered hash map with timing complexity of O(1), which accelerates the searching [40].

Table 5.7 Timing overhead analysis against architecture simulator [207]

Timing Overhead for Different Modes of PeMM Timing overhead using different modes of PeMM is present in Fig. 5.13, which indicates that run-time efforts of enhanced analysis also consumes larger overhead for all benchmarks.

Fig. 5.13
figure 13

Copyright ©2015 IEEE

Run-time among different PeMM modes [207]

5.2.5.4 Application-level Error Locations

Another advantage of error prediction is its ability to predict error locations in the huge memory space, which is difficult to perform by fault injection. Such feature helps the designer to predict how architecture errors affect application results. The median filter [85] is demonstrated to show application-level usage of PeMM flow. Figure 5.14 shows both input and output images. Two tokens are injected in the memory locations storing selected pixels of the input image, while the accordingly effective regions are predicted in the output image. The prediction matches the algorithmic specification, where the value of each pixel in the output image has the average value of the pixels at the same position and surrounding 8 ones in the input image.

Fig. 5.14
figure 14

Copyright ©2015 IEEE

Error prediction for median filter application [207]

5.2.6 Summary

In this work, probabilistic error masking matrix (PeMM) is proposed to analyze the error masking effects of logic circuits. Integrated with PeMM algebra, an approximate error prediction framework is developed to track the path of error propagation and error probabilities. The proposed framework achieves high prediction accuracy and significant speed-up compared with state-of-the-art RTL fault injection technique.

5.3 Reliability Estimation Using Design Diversity

Redundancy is a key feature among fault tolerance techniques [102], which improves the data integrity of the system. Mathematically speaking, data integrity shows the probability of a system either producing the correct result or detectable errors. Hardware redundancy executes logic operations repeatedly on several hardware copies to verify the correctness. Selected works on such modular redundancy are Redundant Multi-Threading (RMT) [137, 157]. In parallel, software-based redundancy re-executes instructions when idle instruction slots are available [155].

Redundancy is constructed based on duplication, where two or more modules perform the same operation and evaluate the result through comparison. One metric to evaluate redundancy system is its ability against common-mode failures (CMFs), where different copies in the system are subjected to the same type of error [118].

Although fault injection [90] and analytical techniques [21] can be applied to estimate reliability, such approaches do not quantify the effects of CMFs on the redundant system. To address this, design diversity has been proposed in [12] to protect circuit-level design from CMF. In [106, 156] design diversity assists the development of robust systems. Mitra et al. [133] formally adopt it as quantifiable evaluation metric on duplicated system. Previous works mainly apply design diversity on circuit-level designs.

Contribution This work extends the usage of design diversity from circuit-level to architecture-level through a novel graph-based analysis flow on operation exclusiveness. The proposed approach is used to quantify design diversity of various classes of architectures. The reliability of applications running on different architectures is quantified through system Mean-Time-to-Failure, which is closely related to design diversity.

5.3.1 Design Diversity

A duplex system is shown in Fig. 5.15 which consists of modules performing the same functionality. Both outputs are verified by a comparator to detect errors. Design diversity refers to the fact that different module implementations can possibly produce different outputs which are detectable facing CMFs. Figure 5.15 also shows the multiplex system consisting of more than two modules.

Fig. 5.15
figure 15

Copyright ©2015 IEEE

Duplex and multiplex redundant systems [208]

Assuming a pair of faults \((f_i, f_j)\) is injected into the modules respectively. Design diversity \(d_{i, j}\) of fault pair \((f_i, f_j)\) is mathematically defined in Eq. 5.3. Provided n as the number of total input bits, \(2^n\) is the number of all input combinations. \(k_{i, j}\) is the joint detectability, which is the number of input combinations producing undetectable errors.

$$\begin{aligned} d_{i, j} = 1 - \frac{k_{i, j}}{2^n} \end{aligned}$$
(5.3)

The design diversity of the system is defined as Eq. 5.4, which is the expected value of design diversity of all possible fault pairs. \(d_{i, j}\) is the design diversity of fault pair \((f_i, f_j)\) while \(p(f_i, f_j)\) is the probability of fault pair \((f_i, f_j)\). As system design diversity represents the probability of the system with error free or detectable errors, Eq. 5.5 shows the system error probability by simply subtracting design diversity from one.

$$\begin{aligned} D = \sum _{(f_i, f_j)} p(f_i, f_j)d_{i, j} \end{aligned}$$
(5.4)
$$\begin{aligned} E = 1 - D \end{aligned}$$
(5.5)

Similarly, for a multiplex system design diversity \(d_{i, j,.., k}\) of the fault set \((f_i, f_j,\ldots , f_k)\) is shown as Eq. 5.6.

$$\begin{aligned} D = \sum _{(f_i, f_j,\ldots , f_k)} p(f_i, f_j,\ldots , f_k)d_{i, j,\ldots , k} \end{aligned}$$
(5.6)

Design diversity can be calculated through exhaustive simulation based on fault injection. Technique is proposed to efficiently estimate design diversity [134]. In [133] design diversity based design achieves significantly reliability against CMFs.

Fig. 5.16
figure 16

Copyright ©2015 IEEE

Implementation for Full Adder (FA) and Full Subtractor (FS) [208]

An example of design diversity is present in Fig. 5.16, where two implementations of 1-bit full adder and subtractor are shown. The calculated design diversity under the worst case condition [133] is present in Table 5.8. The results show that the duplex system with different implementations achieves better design diversity, which corresponds to higher reliability.

Table 5.8 Design diversity for different implementations in Fig. 5.16 [208]

5.3.2 Graph-Based Diversity Analysis

Previously the design diversity metric is adopted to analyze circuit-level redundant techniques. In this work, it is applied for architectural analysis using a combined approach of graph-based analysis and circuit design diversity. The analysis on Major computational building blocks such as RISC and VLIW processors, as well as CGRA (Coarse-Grained Reconfigurable Architecture), are presented. The combined flow is briefed as following:

  1. 1.

    Quantify the amount of conflict functional units which can be simultaneously executed through graph based exclusiveness analysis.

  2. 2.

    Calculate circuit-level design diversity for the conflict functional units with the technique in Sect. 5.3.1.

  3. 3.

    Use the quantified design diversity to estimate application-level design diversity.

The proposed analysis flow estimates the maximal design diversity for the specific architecture, which can be used to evaluate reliability among architectures. The graph-based analysis is originated from graph representation of LISA operations, which is introduced in the following.

5.3.2.1 Graph Representation in LISA Language

LISA 2.0 language [1] has been used to describe various architecture variants such as ASIP [184], ASIC [200], CGRA [151]. The key concept of LISA is the Directed Acyclic Graph (DAG) of operations. A DAG can be represented as a graph \(D=<V, E>\), where V indicates operations performing specific functions and E represents activation or scheduling of the child operations by the parental ones. Figure 5.17 visualizes the DAG for a RISC processor with 5 pipeline stages. In decode stage, 4 groups of operations are decoded into EX stage for execution. The DAG also shows the coding fields of specific operations, which are terminal field (shown as bit ‘0’ or ‘1’) or non-terminal fields (shown as label referring to child operations).

Fig. 5.17
figure 17

Copyright ©2015 IEEE

Directed acyclic graph with ISA coding for ADL model [208]

5.3.2.2 Exclusiveness Analysis

The exclusiveness analysis of operations determines whether operations in the DAG can be executed in the same clock cycle. It is originally proposed in [212] for decision on resource sharing of mutually exclusive operators. The information on exclusiveness can be extracted from the coding and activation condition in DAG, into another graph representation called conflict graph as shown in Fig. 5.18. The edges between operations in conflict graph indicate that they are not mutually exclusive or conflict, which can be executed in the same cycle. Besides, operations from different pipeline stages are shown in different colors and are always conflict with each other. For simplicity, edges between operations from different stages are not shown. An example is that operation Arith in Fig. 5.18 is conflicting with Decode, Add, Sub, And and Or, but exclusive with the rest operations.

Fig. 5.18
figure 18

Copyright ©2015 IEEE

Conflict graph for selected operations in Fig. 5.17 [208]

5.3.2.3 Diversity Analysis

One desired requirement to the redundant system is the simultaneous execution of logic functions on duplicated hardware copies. Such information can be acquired from the exclusiveness analysis of the DAG graph. To incorporate the analysis, a novel graph representation named Conflict Multiplex Graph (CMG) is proposed with following information:

Theorem 5.3.1

Exclusiveness is indicated by colors, where the operators with the same color are mutually exclusive.

Theorem 5.3.2

Functionality is indicated by edges, where the solid edge between operations indicates identical implementation and the dash edge shows diverse implementation.

Figure 5.19 presents the CMG as well as the DAG for the EX stage of RISC processor, which consists of 7 operations. This work mainly focuses on the arithmetic and logical operations, which exist among all architecture variants. Compared to Fig. 5.17, 2 additional operations are decoded by the coding field Chk, which is intended to check both Arith and Logic operations. The operations decoded by Chk and All_insn are conflict with each other since they are from different coding fields in Decode. Hence, MAC and And2 are shown in different colours as the rest ones. Regarding functionality, MAC can achieve the same functionality as Add, Mul and Sub with diverse implementations, so they are connected by dash edges. And2 can only duplicate And1 with identical implementation, which indicates a solid edge between them. No further edges exist in the CMG since other operations are either not able to repeat functionality or mutually exclusive.

Fig. 5.19
figure 19

Copyright ©2015 IEEE

Directed acyclic graph and conflict multiplex graph [208]

The CMG based analysis assists to quantify the duplex/multiplex pairs for a specific operation, while the design diversity for each pair is calculated from circuit-level simulation technique as in Sect. 5.3.1. The calculated diversity for selected pair of logic functions is listed in Table 5.9.

Table 5.9 Duplex pairs for EX pipeline stage in Fig. 5.19 [208]

5.3.2.4 CMG for Several Architecture Variants

In this section, CMGs of several architectures are presented to identify the redundancies in architecture level. It is worth noticing that such analysis detect the theoretical maximal redundancy. Further software or compilation techniques must be designed to actually utilize such redundancy, which is not covered in this work. The CMG-based analysis provides an analytical methodology to benchmark design diversity for different architectures.

TMR Triple Modular Redundancy (TMR) is a widely used technique which exploits three logic units to verify the correctness of protected operation. One example of the CMG of TMR architecture is shown in Fig. 5.20, where Add and Sub operations are under protection by two extra copies. Add2 and Add3 are identical while Add1 is diversely implemented. All Sub1, Sub2 and Sub3 are identical. The diversity of such multiplex pairs is calculated by Eq. 5.6. It is worth mentioning that Add2 and Add3 are conflict with all operators in blue colour. However, the regular TMR implementation, which groups all three addition operations together without access to others, may limit Add2 and Add3 to form pairs with operations other than Add1.

Fig. 5.20
figure 20

Copyright ©2015 IEEE

Conflict multiplex graph for TMR Architecture [208]

URISC URISC [150] proposes the fault tolerance technique by adopting the Turing complete instruction subleq, which executes in the co-processor to diversely duplicate the instructions in the main processor. The approach is abstractly present in Fig. 5.21. Since subleq is separately decoded in the coprocessor and able to perform functionalities of all operations, it forms the diverse pair with all operations in the main core.

Fig. 5.21
figure 21

Copyright ©2015 IEEE

Conflict multiplex graph for URISC Architecture [208]

VLIW Multiple instruction syllables, which are separately decoded, are applied in VLIW processor for parallel execution. The CMG for VLIW with four syllables are present in Fig. 5.22. Each operation in one syllable are conflict with all operations from other syllables to form multiplex system. For example, Sub1 can form identical duplex pair with Sub2, Sub3 and Sub4, while also diverse pair with Add2, Add3 and Add4.

Fig. 5.22
figure 22

Copyright ©2015 IEEE

Conflict multiplex graph for VLIW Architecture [208]

CGRA CGRA architecture consists a large number of processing tiles interconnected through a specific network topology. Several prefabricated functional units (FUs) exist in each processing tile, whose functionalities are selected during the post-fabricated configuration phase. The difference between CGRA and FPGA is that FPGA applies the FU of look-up-table, which can realize fine-grained design than CGRA. For each configuration, only one function is realized in each tile, which shows FUs inside one tile are mutually exclusive. However, the configuration does not constrain the FU functionality across tiles. The Fig. 5.23 shows the CMG of CGRA with six tiles, where a huge amount of identical and diverse pairs are indicated.

Fig. 5.23
figure 23

Copyright ©2015 IEEE

Conflict multiplex graph for CGRA Architecture [208]

5.3.3 Results in Diversity Estimation

This section presents several case studies on design diversity based reliability analysis. First, application-level design diversity is estimated based on architecture-level design diversity and instruction statistics. After that, system-level Mean-Time-To-Failure (MTTF) is derived from design diversity.

5.3.3.1 Architecture Diversity Evaluation

Three architecture variants including RISC, VLIW and CGRA are present for analysis. Four exemplary operations, which are Add, Sub, Sll, Srl, are chosen for calculation of design diversity. Table 5.10 lists the number of pairs for both identical and diverse system. The identical system consists of a single type of modules for each operation, while the diverse system has an equal number of two types of modules. Design diversity is evaluated according to Eq. 5.6, where all modules of the same operation are used to verify the correctness of such operation.

Table 5.10 Architecture variants of design diversity evaluation [208]

Figure 5.24 shows the estimated architecture-level design diversity, which shows similar trends among all architectures. More modules in the system always achieve higher design diversity. With the same amount of modules, diverse implementations lead to better design diversity than identical ones. Quantitatively speaking, RISC architecture with two diverse modules has comparable design diversity as VLIW with four identical modules.

Fig. 5.24
figure 24

Copyright ©2015 IEEE

Design diversity of architecture variants [208]

5.3.3.2 Application-level Diversity Evaluation

Taking advantage of architecture-level analysis, application-level design diversity is introduced in Eq. 5.7. While \(D_{op}\) directly refers to the architecture design diversity for operation op, \(P_{op, app}\) is the percentage of operation op among all operators in application app. Assembly-level instruction profiler can find \(P_{op, app}\) for any high-level applications. To increase application-level design diversity and reduce error probability, it is desirable to execute the operations with a higher percentage on more diverse modules.

$$\begin{aligned} D_{app} = \sum _{op} P_{op, app} D_{op} \end{aligned}$$
(5.7)

The PD_RISC processor from the IPs of Synopsys Processor Designer [184] is used to evaluate design diversity for several embedded applications. The cycle-accurate instruction-set simulator generates the statistics on instruction profiling.

Fig. 5.25
figure 25

Copyright ©2015 IEEE

Application-level design diversity for PD_RISC processor [208]

Figure 5.25 presents the evaluation of application-level design diversity on PD_RISC processor. Add, Sub, Sll, Srl are targeted operations. Among all applications, diverse systems result in higher design diversity than identical ones. The difference in absolute values is caused by the difference in operation percentage among applications.

5.3.3.3 Mean-Time-To-Failure Estimation

\(MTTF_{op}^{arch}\) for a specific operation op of the architecture arch can be estimated using the failure rate \(\lambda _{op}^{arch}\) introduced in Eq. 5.8. For a transient bit-flip fault model, by Eq. 5.9, \(\lambda _{op}^{arch}\) is further derived from \(P_{op}^{1 fault, arch}\), which is the probability of one fault injected in all modules of multiplex system with operator op in architecture arch, and operator error probability \(E_{op}^{arch}\), which equals \(1-D_{op}^{arch}\) as in Eq. 5.5 and \(D_{op}^{arch}\) is the design diversity of multiplex system with operator op in architecture arch. In Eq. 5.10, \(P_{op}^{1 fault, arch}\) is further related to the architecture dependent product of module-level fault probability \(P_{op, i}^{1 fault}\), which corresponds to the division of area estimation of the operator \(A_{op, i}\) by the constant \(A_{1 fault/hour}\). \(A_{1 fault/hour}\) is the size of area that injection of one fault happens per hour under a specific environmental condition. Such condition is acquired by the reciprocal of Failure-in-Time (FIT) [71] in Eq. 5.11. For instance, this work assumes the FIT as \(10^{-4} cph/\mu m^{2}\). The unit is fault count per hour (cph) per unit area (\(\mu m^{2}\)).

$$\begin{aligned} MTTF_{op}^{arch} = \frac{1}{\lambda _{op}^{arch}} \end{aligned}$$
(5.8)
$$\begin{aligned} \lambda _{op}^{arch} = P_{op}^{1 fault, arch} E_{op}^{arch} = P_{op}^{1 fault, arch} (1-D_{op}^{arch}) \end{aligned}$$
(5.9)
$$\begin{aligned} P_{op}^{1 fault, arch} = \prod _{i}^{arch} P_{op, i}^{1 fault} = \prod _{i}^{arch} (A_{op, i}/A_{1 fault/hour}) \end{aligned}$$
(5.10)
$$\begin{aligned} A_{1 fault/hour} = \frac{1}{FIT} \end{aligned}$$
(5.11)

Table 5.11 presents estimated \(A_{op, i}\) and \(P_{op, i}^{1 fault}\) for four operators according to information of 90nm Faraday technology cells [61].

Table 5.11 Failure rate estimation for four operators [208]

Calculated by \(D_{op}\) from Fig. 5.24, the estimated \(MTTF_{op}^{arch}\) for four operators on several architecture variants is present under logarithmic scale in Fig. 5.26. It is observed that CGRA architecture is naturally more robust than VLIW, which is on the other hand reliable than RISC architecture.

MTTF increases with both the increasing number of modules in the system and the design diversity for the same operation. Benchmarked with Fig. 5.24, RISC architecture with two diverse modules leads to significantly less MTTF than VLIW with four identical modules. This is caused by the fact that \(P_{op}^{1 fault}\) for VLIW is much smaller than RISC since more modules in the multiplex system indicate a lower probability that one single type of fault happens in each module. CGRA shows similar trends as VLIW. Among all four operators, Sll shows the highest MTTF which results from its smallest size and relatively high design diversity.

Fig. 5.26
figure 26

Copyright ©2015 IEEE

Mean-time-to-failure of architecture variants [208]

5.3.4 Summary

In this work, design diversity metric, which is originally proposed to quantify reliability for circuit-level designs, is extended into the architecture-level analysis of different processing architectures. This is achieved through a novel graph-based analysis on functionalities and exclusiveness of operations in the architecture. The proposed approach is applied to architecture and application-level design diversity estimation, as well as system Mean-Time-To-Failure.