Towards Tolerating Soft Errors for Embedded Systems

Abstract

Exponential growth in the number of transistors for each chip along with increasing clock frequencies and operational voltages and decreasing load capacitance are aggravating the possibility of occurrence of soft errors in embedded systems. Transistors on current chips have components separated by only a few hundred atoms; hence, a small voltage glitch can alter the state of the transistor, thus causing soft errors in the systems. The impact will be a matter of great concern when the line widths will shrink further. These complicated linkages among the components in chips directly affect the reliability of embedded systems and cause them to be sensitive to soft errors. The common approach to address such errors is focused on post-design phases that are complex and costly to implement. However, reliability, which is a vital non-functional attribute of a system, should be validated at the design phase, particularly for critical systems. This paper proposes an efficient approach to measure and minimize the potential threats of soft errors for embedded systems in the early design phase of system-level design. The methodology is validated against a system model that must have high reliability.

Introduction

New technologies coupled with design constraints and growth of non-functional requirements are making the design of embedded systems and the Internet of Things (IoT) more complex [1, 2]. Size and weight have become two vital constraints for embedded systems since applications have extended to handheld devices. On an average, in every 2 years, new process technology is being announced, presenting reduced transistor dimensions, operating voltages, and design margins. Line widths of 32 nm are already attained and improvement towards 18 nm is expected within a few years [3]. Chips currently contain transistors that have components merely some hundred atoms wide. Therefore, very little energy is needed to change the state of the transistor, thus causing transient faults. These transient faults lead to soft errors in embedded systems employing these chips. This impact will become a serious concern as line widths further decrease. At one end, the technological advancements require choosing up-to-date processors to support multifaceted functionalities of systems; while at the other end, the diminishing processors’ size coupled with a reduction in voltage levels causes a decrease in the natural resilience of chips to soft errors. The occurrence of soft errors in embedded systems and IoT is indeed a matter of concern as these systems require high reliability and have strict real-time requirements. Besides, in an embedded system, software and hardware are so intricately linked that the system becomes sensitive to soft errors. Moreover, such systems become more susceptible to soft errors in electronic-hostile environments. In circa 2000, Sun Microsystems admitted that cosmic rays (one of the sources of soft errors) affected cache memories which in turn smashed server systems [4]. The faults that occurred in the control software of the Ariane-5 missile system and the Mars Pathfinder are two examples of the disastrous consequences of soft errors [5].

Prior research to address the occurrence of soft errors in embedded systems has focused primarily on post-design phases. However, hardware and software duplication incur problems for synchronizing identical threads and meeting the requirement of supplementary hardware.

Many approaches in literature have been presented that address the designing of resilient and fault-tolerant embedded systems like FPGAs. Although traditional techniques like error correcting codes (ECC) and Triple Modular Redundancy (TMC) can be applied to improve reliability of these systems, they introduce additional computational and power overhead and may not be suitable for high performance systems [6]. These techniques have been either extended to remove these limitations or work on addressing specific modules or functionality of the FPGAs. For example, [7] applies partial Triple Modular Redundancy (TMC) to reduce Single Event Upsets (SEU) in FGPAs by focusing on particular (persistent and non-persistent) bits in the configuration memory. The approach in [8] focuses on the identification of critical nodes in FPGA designs and applies redundancy techniques to such nodes by modifying the code. While the resultant design is not altered functionally, it becomes more hardened to soft errors and the failure rate is reduced [8]. In [9], the authors have designed a compact but reliable PUF (Physical unclonable function) ID generator circuit for FPGAs. Although the proposed circuit ensures reliability and robustness under different environmental conditions, this is for a specific hard-coded functionality of an FPGA. In addition to these techniques, there are researchers attempting to build reliability checkers and quantifiers into the design tools. For example, [10] proposes to combine probabilistic reliability models into the FPGA design suite that work at the logic layer. The tool provides the error probability and the reliability of the design (circuit) as an output to help the FPGA designers improvise.

Reliability, which is a vital non-functional attribute of a system, should be validated at the design phase, particularly for critical systems [11]. But clearly, testing conclusively across the entire design space is complex. Hence, the focus should be on identifying those components in the system model where the occurrence of soft errors would be most damaging to the system. From this perspective, two questions that would need to be answered are as follows.

  • If a soft error was to occur at a given point (component) in the system, then what would be the impact on system functionality? This is referred to as component criticality in the rest of the paper.

  • If the impact is severe, then how this impact could be lowered to minimize the risk of functional degradation of the system?

These questions are addressed in this paper which advances work done in our previous paper [12]. The rest of the paper is organized as follows. “Related Work” depicts the related work (so far, we have reviewed) in the proposed area of research. The proposed methodology to analyze the components’ criticality and to lower these criticalities is described in “Analyzıng Crıtıcalıty of the Components”. “Case Study” illustrates the methodology using a case study and “Conclusions” concludes this paper.

Related Work

Recognizing the risks and the threats imposed by soft errors in a computer system, several measures have evolved to tolerate soft errors at different levels. Some of these approaches are outlined below.

At the process technology level, solutions tend to reduce the soft error rate by removing the radiation source or contaminated material from the process flow [13,14,15,16,17]. To reduce thermal neutron-induced soft errors, BPSG should be eliminated from the process flow [13]. However, integrating these types of solutions in chip processes increases the manufacturing cost and lowers the yield compared to the bulk process. Silicon-On-Insulator (SOI) technology [15, 16] tends to alleviate high-energy neutrons originating from cosmic rays by incorporating a very thin silicon layer of buried oxide (silicon on insulator—SOI) in IC chips. Compared to bulk CMOS counterparts, charge collection in SOI technology is limited to the shallow depth of the silicon film and hence, SOI devices collect less charge from an alpha or neutron particle strike [17]. However, the soft error rate may increase due to lower operating voltage, reduced junction capacitance, and amplification by parasitic bipolar transistors. To lower this increased sensitivity, commercial microprocessors with PowerPC architecture use partially depleted SOI processes. The majority of process solutions fail to reduce SER by more than five times, which does not justify the additional expense of process complexity, yield loss, and substrate cost.

At the software level, fault detection is generally achieved by adding instruction and information redundancies [3, 18, 19]. However, it incurs additional memory cost (for the additional data and instructions) and performance overheads (for the replicated computations, and the consistency checks). Error detection and Correction Codes (ECC) [3] add extra bits to the original bit sequence to detect an error. Extra bits are added at the sending end and if there are any changed bits detected at the receiving end then one can detect the error. However, applying ECC to combinational logic blocks need extra logic and calculations.

At the hardware level, approaches to tolerate soft errors mostly emphasize circuit-level solutions, logic-level solutions, and architectural solutions [20,21,22]. In the case of circuit-level solutions, generally, two approaches are used to lower the threats of soft errors; first is to increase the critical charge (Qcrit) of a circuit node, and second is to add redundant transistors to enable redundant storage of information. Critical charge (Qcrit) is generally improved using gate sizing techniques, enhancing capacitance, and resistive hardening. However, these methods incur power overhead and slower the operations. Resistive hardening showed the solution using passive, poly-silicon intra-cell decoupling resistors in the cross-coupling segments of each SRAM cell [20]. To ensure the soft error tolerance within the whole temperature limit, the minimum intra-cell resistance is necessary to be derived at the maximum temperature of interest. Liu et al. [21] and Calin et al. [22] used redundant transistors to mitigate the soft error problem. It can detect soft errors and restore the data if required. However, these redundant transistor-based circuits increase the area and result in excessive power usage. Moreover, the issue of bandwidth requirements and latency during inter-processor communication is overlooked in these types of solutions.

Hardware and software combined approaches use multiprocessors’ chips and redundant multi-threading to tolerate soft errors [23,24,25]. However, these post-functional design phase approaches incur additional cost and complexity without contributing any re-markable performance advancement.

Criticality analysis at a sub-system level, along with Failure Mode and Effect Analysis (FMEA), is also becoming popular in fault-tolerant research. The FMEA is an analysis of potential failure modes within a system classified by severity or determination of the effect of failures on the system. Risk Priority Number (RPN) [26], MIL_STD 1629A Criticality Number ranking [27], and multi-criteria Pareto ranking [28] are common methods for assessing criticality in FMEA. In RPN, the risk number is a function of occurrence ranking, severity ranking, and detection ranking. The failure modes of an action do not always mean that it (the action) has the highest severity numbers. There could be less severe failures, but the action may occur more often or is less detectable. In the MIL_STD 1629A [27], the criticality number is calculated by the product of the failure mode ratio, the probability of failure’s impact, the part failure rate, and the considered duration. Pareto ranking [28] is an enhancement to the MIL_STD 1629A criteria where severity is measured on a ratio scale instead of an ordinal scale. However, difficulties in calculating failure rate values or probability of failure make MIL_STD 1629A and Pareto ranking unpopular with researchers.

In our previous approach [12], the specified metrics may not be sufficient to return the complexity of a system. For example, a component with low complexity may have thousands of simple (concerning system functionality) iterations. Whereas, a component having a few lines of instructions, which require much lesser time to be executed, may be more complex than other components. Hence, more metrics are needed to correctly reflect the complexity of both the component and the system.

Analyzıng Crıtıcalıty of the Components

Evaluating the reliability of a large number of components at the design level, and subsequently evaluating the reliability of the system when the individual components interact with each other becomes a non-trivial and time-consuming task. The critical issue in dependability theory is to have timely and accurate indicators and metrics to characterize the dependability of the system [29]. The next section illustrates the methodology used to associate metrics to components based on their criticality and complexity.

Complexity Analysis

The more complex a system, the more is the likelihood of its encountering soft errors. Thus, identification and subsequent reduction of a system’s complexity is an essential part of system design. And since a system’s complexity is a direct function of the complexity of individual components that it is comprised of, an individual component’s complexity must be defined and measured. The complexity of a component depends on many factors. Some of them are given below.

Execution Time (ET) of a Component

As the duration of execution of a component increases, the likelihood of it encountering soft errors increases. This follows from the fact that a component’s Failure-In-Time (FIT) is a function of its execution time [30].

Interdependency Among Components

The faulty behavior of a component is a function of a component’s interdependency and interaction with other components. The frequency and magnitude of Message-In-And-Out of a component, thus, reflect its complexity.

Static Complexity of a Component

As described by Jürjens et al. [31], the static complexity of a component is a function of its design constituents. These are briefly described below. For a more detailed explanation, the reader is referred to [32].

Number of Parts (NOP)

As the number of parts in a component increases, dependency, and synchronization related complications increase which may increase the likelihood of soft errors in the component. Thus, NOP that a component is comprised of contributes to its complexity.

Number of Required Interfaces (NRI)

The required interfaces of a component are used to characterize its connections with other components. If the interfaces are not properly defined in the component then it may lead to faults. Hence, NRI is another attribute that affects the complexity of a component.

Number of Provided Interfaces (NPI)

NPI contributes to the complexity of a component since it represents the usage of a component by other components in the system.

Cyclomatic Complexity of State Machine (CCS)

State machines are typically used to define the behavior of components in a system. Properties of these state machines can be used as input in the complexity calculation of components [31]. Thus, the cyclomatic complexity of a component, i, can be calculated as CCSi = |T| + |E| + |AG| + 2, where T is the multi-set of transitions, E is the multi-set of event triggers, and AG is the multi-set of atomic expressions in the guard conditions in the state machine description of the component. Based on the above description, the overall complexity of a component, COMi, could be represented by (1).

$${\text{OCCOM}}_i = f\left( {{\text{et}}_i ,{\text{ mio}}_i ,{\text{ sc}}_i } \right),$$
(1)

where eti, mioi, and sci are execution time, message-in-and-out frequency, and static complexity for the ith component, respectively.

Measuring the Severity of Failure of the Components

A solitary bit upset in a single component might be more severe than multi-bit upsets in several components. Hence, these impacts are needed to be evaluated for the individual component. Component’s complexity and severity are combined to get an improved impact on the functionality of the system when there arise a bit upsetting in the components. In this paper, the severity of failure of the components and messages are determined by the Failure Mode and Effects Analysis (FMEA) [28] method. FMEA analyzes probable failure modes of a component concerning its failure’s impact on the functionality of the system. It also returns the order of components’ failure by evaluating its impact. To address these limitations, a ten-point scale, which is suggested by Hosseini et al. [33], has been used in this paper. In their paper, the severity ‘Hazardous’ is ranked with the highest scale value 10; the severity ‘Serious’ is ranked with the second-highest scale value 9; in this fashion, it goes down to the severity ‘Very minor’ which is ranked with the second-lowest scale value 2; and the severity ‘None effect’ is ranked with the lowest scale value 1.

To apply FMEA on a model and to design appropriate faults for injection to sufficiently cover all fault space cases, domain expertise is required. For example, to inject an appropriate ‘bit flip-type’ of faults into the model, it is important to know and differentiate between variables and value ranges. More specifically, for a statement x = 7, both the variable (x) and the value (7) could potentially be altered because of a single bit flip fault. While 7 (0000 0111 in binary) could be changed to 3 (0000 0011 in binary) or 5 (0000 0101 in binary), the variable x (0111 1000 in binary) could be changed to y (0111 1001 in binary) or to h (0110 1000 in binary). Thus, designing appropriate test cases that cover all equivalence classes is important.

Measuring the Propagation of Failure from the Components

When some components fail, they cause failures (ripples) to propagate or spread across other components in the system causing more faults. On the other hand, some components can contain the fault caused in the component and restrict its spread outside the component. Thus, the measurement of the ability of a component to contain or propagate a failure is important, as it affects the criticality of the whole system. Figure 1 illustrates the method of measuring the propagation of failure. It is a typical example of a system containing components: COM1, COM2, and COM3. NET denotes the network environment communicating with the system. The product of complexity and severity of these three components are PSev1, PSev2, and PSev3 respectively. In Fig. 1, Sev1…Sev10 flag the severity in the messages where indexing is made according to their incidences in the system. In this paper, we have not shown the propagation of failure from or in the environment. Any error inside COM1 causes a rise in its consequences in COM2 to PSev1Sev2, as it could propagate from COM1 to COM2.

Fig. 1
figure1

A typical example of a system to measure the propagation of failure

After passing the 2nd message, the increased consequence in COM2 is PSev1Sev2 and after passing the 3rd message, the increased consequence in COM3 is PSev1Sev2s2Sev3. Similarly, after passing the 9th message, there is an increase in the level of consequences in COM1: PSev1Sev2s2Sev3s3Sev4s2Sev5s2 (Sev6 + Sev7)PSev1Sev8s1Sev9. Thus, the entire consequences in the system can be measured as:

$${\text{CONCOM}}_{1} = {\text{PSev}}_{1} {\text{Sev}}_{2} {\text{s}}_{2} {\text{Sev}}_{3} {\text{s}}_{3} {\text{Sev}}_{4} {\text{s}}_{2} {\text{Sev}}_{5} {\text{s}}_{2} \left( {{\text{Sev}}_{6} + {\text{Sev}}_{7} } \right){\text{PSev}}_{1} {\text{Sev}}_{8} {\text{s}}_{1} {\text{Sev}}_{9} ).$$

Similarly, other consequences in the system can be measured. For detailed calculation, the readers can see our previous paper [12].

Measuring Criticality of a Component

The criticality of ith component Cri can be derived by (2).

$${\mathrm{Cr}}_{i}=\left({w}_{1}\times {\mathrm{OCCOM}}_{i}+{w}_{2}\times \mathrm{CON}\left({\mathrm{COM}}_{i}\right)+{w}_{3}\times \mathrm{Se}\left({\mathrm{COM}}_{i }\right)\right),$$
(2)

where \({\mathrm{OCCOM}}_{i}\) is component’s complexity, \(\mathrm{CON}\left({\mathrm{COM}}_{i}\right)\) is component’s propagation of failure, and \(\mathrm{Se}\left({\mathrm{COM}}_{i}\right)\) is the component’s severity of the failure, and \(w,\) \({w}_{2}\), \({w}_{3}\) are the weights for the component’s complexity, propagation of failure, and severity of the failure. In this research, the values of \({w}_{1},\) \({w}_{2}\), \({w}_{3}\) are assumed as 0.5, 0.8, and 0.9 respectively based on qualitative analysis (manual analysis depending on their impact on the system). The methodology of criticality analysis is shown in Fig. 2. Once the criticalities of individual components are calculated, they are ranked to obtain the criticality ranking of components. The result is then analyzed to flag the components that, if affected by soft errors, can influence the system the most. This paper then targets these components to lower their criticalities to minimize the threats of soft errors.

Fig. 2
figure2

The methodology of criticality analysis

Lowering the Criticality of a Component

The criticality of a component is an indicator that flags wherein the system design revision is required, to reduce the threat of soft errors. The exercise does not guarantee the correction of all errors or their minimization; rather, the most error-prone components in the system are identified where design modification may lower the threat of soft errors. This process is an iterative revision process. In each iteration, it is seen whether the impact of soft errors in the system can be lowered, while satisfying other non-functional properties of the system. If any property is not satisfied, other alternatives are subsequently tried until either the goal is achieved, or all alternatives are exhausted. The criticality of a component or system can be lowered by lowering any of the following: complexity, severity, or propagation of failure. To better illustrate the methodology of lowering the criticality of components or system, an example scenario of a system model, as shown in Fig. 1, is used.

If COM1 is the most critical of the three components (based on its complexity, severity, and propagation of failure), then, the design is analyzed to find which factor among the three is resulting in the large value. A higher value of complexity suggests that probably its calling time captures a longer duration within the whole execution period, or it has a large number of dependencies with other components. If the severity of COM1 shows a higher value, then it means that the soft error in it has more effect on the overall system functionality than COM2 and COM3. If the large criticality is due to the propagation of failure from COM1, it implies that that COM1 is seen in most of the communication scenarios as a starting node. Then, the architecture or behavioral model of the entire system is analyzed to minimize any of the constituents of criticality for a component maintaining all other constraints unaffected or making a trade-off among them.

Case Study

To validate the methodology, proposed in this paper, it is applied to a smartphone system. This system, which is required to have high reliability, is described in more detail in the next sub-sections. The abbreviations which are used in this section for simplicity is shown in Table 1.

Table 1 Table of abbreviations

Smartphone System

A smartphone system comprises multiple hardware components and interconnects, i.e., processors, memory, input/output devices, and other circuitry, peripheral hardware, and software modules that run on one or more of these components. For the simulation of smartphone application, IBM Telelogic Rhapsody (UML Model Driven Development (MDD) environment for technical, real-time or embedded systems and software engineering) is used. It is an advanced MDD solution that helps to create precise, easy-to-understand design and test specifications of large, complex systems and their related applications. In fact, the model can be used to describe all aspects of the deployable system, including hardware, data, personnel, procedures, facilities, application servers, and software. The core support of the tool is the generation of code implementations from structural semantics that can be specified by UML diagrams. The smartphone model used in this case study has three main components: connection management component (CMC), mobility management component (MMC), and data link management component (DMC). CMC handles the reception, setup, and transmission of incoming and outgoing call requests; MMC monitors registration, and DMC handles the registration and location of users. There are two actors in the system: User Interface (UI), and Network environment (NET). UI represents the handset user interface, including the keypad, and display; and places and receives calls. NET represents the system network environment or infrastructure of the signaling technology, and tracks users' monitors signal strength, and provides network status and location registration. The analysis of the criticality of the system is described next.

Complexity Analysis for the Smartphone System

For measuring ET: a log file is used to store the data of state transition time occupied by the components of the smartphone. The log file is studied when the simulation ends, and the total ET of the components are measured finally.

TMIO is measured by summing MIOs of all three components in the smartphone system. As discussed in “Interdependency Among Components”, MIO is a component’s interdependency and interaction with other components. The values are derived by calculating the number of input and output edges in a component.

To derive SC, we have determined the system’s CCS at first. Then, all values of SC are divided by 100 to obtain the values of NSC. Table 2 shows the CCS of the smartphone system. Table 3 shows the total complexity of individual components which are calculated by (3) and (4).

Table 2 Cyclomatic complexity calculation of the Smartphone System
Table 3 Overall complexity (occ) calculation
$$\mathrm{SC}=\mathrm{NOP}+\mathrm{NRI}+\mathrm{NPI}+\mathrm{CCS}.$$
(3)
$$\mathrm{OCCOM}=\mathrm{NSC}+\mathrm{ET}+\mathrm{TMIO}.$$
(4)

Measuring Severity of Failures

The severity of failures of the components and messages are determined by FMEA. Here, the impact of soft errors (only soft errors that degraded the system functionality are considered) in each component is analyzed by injecting transient faults [as described in “Case Study” (B)]. Table 4 shows the severity of failures of the components in the Smartphone system. The severity of the failure of the messages in the following three scenarios is calculated using FMEA. (a) Place call request successful (Fig. 3), (b) Connection Management Place Call Request Success (Fig. 4), and (c) Network Connect (Fig. 5). The calculated severities of the failures of the messages in these three scenarios have been indicated on the respective figures. Transient faults are injected at each message individually and then the consequences are checked at execution and evaluation. As shown in these three scenarios, the severities of the failures of the messages that are linked with UI (User Interface), or NET (Network environment) are not derived. As the scope of this paper only deals with components’ criticalities, (a component’s impact on other components due to soft errors), criticalities related to the external environment are not considered here. As shown in Fig. 3, Sev2 has the highest severity rank (equal to 10); then, the next ranking belongs to Sev6, which is from MMC to DMC. Other severity rankings are also shown in the figure. The message Sev2, in Fig. 4, has the highest severity among the communicating messages in this scenario. Figure 5 shows the severities of the failures of the messages in the Network Connect Scenario. The severities of Sev1, Sev3, Sev4, and Sev5 are indicated without values since they are communicating with the external environment (as analysis of interactions with the external environment is outside the scope of this paper).

Table 4 The severity of failures in the smartphone system
Fig. 3
figure3

Severity of the components and messages in the place call request successful scenario

Fig. 4
figure4

The severity of the components and messages in the connection management place call request success scenario

Fig. 5
figure5

The severity of the components and messages in the network connect scenario

Measuring Propagation of Failure

Propagations of failures from the components are calculated as explained in “Measuring the Propagation of Failure from the Components”. The values of propagation of failures of the components are shown in the second column of Table 5.

Table 5 Components’ propagation of failure and criticalities for the smartphone system

Measuring Criticalities of Components

The criticalities are calculated using (2). These criticalities are presented in the third column of Table 5. It can be observed from this table that CMC is the most critical component of the three components in the Smartphone system.

Significance of Propagation of Failure in Measuring Criticalities of the Components in the Smartphone System

The comparison between criticalities, with and without considering the propagations of failures of the components, is shown in Fig. 6 to validate the significance of propagation of failure in measuring the criticalities of the components.

Fig. 6
figure6

The comparison between criticalities with and without considering the propagation of failure in the smartphone system

As shown in Fig. 6, if the propagation of failure is not considered then the criticalities obtained for MMC and DMC are almost equal, and CMC is not varying much. However, the propagation of failure shows large variations among their criticalities. Generally, if there is a soft error in a component then it affects all of its connected components. Hence, a soft error in the first component (node) of the chain might have greater effect in system functionalities. CMC is generally a starting node, whereas DMC is not. Thus, the consequence of CMC and DMC cannot be equal if the probability of propagation of failure is considered. Our previous paper [12] failed to flag this consequence with a deep analysis. Thus, the propagation of failure determination in the smartphone system is effective in the criticality ranking of the components.

Lowering Criticalities of Components

We have analyzed the structural and behavior models of all three components to consider if there were provisions to revise their design. Our focus in this case was on reducing the complexity of a component, which in turn would reduce the criticality of that component. We have observed that if changes were applied to CMC and DMC then the functionality of the system was affected whereas if changes were applied to a few activities of MMC then the functionality remained the same without breaking other constraints. Hence, the behavior models of MMC are deeply inspected for refactoring to lower the complexity of the system. The state diagram of MMC, the source codes inside each part, and all external parts and codes are scrutinized in such a way that the aim of lowering complexity of MMC can be minimized without affecting the system’s functionality. Two internal states which are developed to update location and to check signal are merged in case of MMC Calling Control activity diagram, the Signal Checking state of In Call activity diagram is detached, and the codes inside the parts of Signal Checking state are merged with the inner codes of Voice-Data state of this activity to be refactored. Due to the application of refactoring, ET of CMC was reduced from 0.29 to 0.228, ET of MMC was reduced from 0.05 to 0.0393, while the ET of DMC remained the same.

The lower values of ET lessen the values of the complexities as well as the criticalities of CMC and MMC. Since these two components are the most critical in the smartphone system, the proposed method minimized the criticalities of these components, which would in turn minimize the threats of soft errors in the system.

Conclusions

In this paper, a new systematic approach was investigated to minimize the soft errors’ risks at the design level of embedded systems. A methodology for measuring the consequences of soft errors affecting a system’s component was presented. By ranking components based on their criticality, this paper suggests the designers where to revise the design to lower the criticalities of the components. Taking a more global view, the “low hanging fruit” of the reliability problem was captured—specifically soft error problems at the design level of embedded systems. This helps to complement the later phases of the embedded design process.

The proposed paper deals with the criticality of the individual component within particular parts of the system. Future enhancement of this research could be to determine the criticality of a whole system to be able to judge the relative criticality of its components in the larger systems’ domain. Hence, new metrics would be needed to determine the criticality of the whole system concerning other possible systems in a large system domain.

References

  1. 1.

    Ghribi I, et al. R-codesign: codesign methodology for real-time reconfigurable embedded systems under energy constraints. IEEE Access. 2018;6:14078–92.

    Article  Google Scholar 

  2. 2.

    Tan B, Biglari-Abhari M, Salcic Z. An automated security-aware approach for design of embedded systems on MPSoC. ACM Trans Embed Comput Syst. 2017;16(5s):1–20.

    Article  Google Scholar 

  3. 3.

    Ahammed S, et al. Soft error tolerance using HVDQ (Horizontal-Vertical-Diagonal-Queen parity method). Comput Syst Sci Eng. 2017;32(1):35–47.

    Google Scholar 

  4. 4.

    Baumann R. Soft errors in commercial semiconductor technology: overview and scaling trends. In: IEEE 2002 reliability physics tutorial notes, reliability fundamentals, vol. 7 (2002)

  5. 5.

    Katoen J-P. Quantitative evaluation in embedded system design: trends in modeling and analysis techniques. In: 2008 design, automation and test in Europe, IEEE (2008)

  6. 6.

    Van Harten LD, Mousavi M, Jordans R, Pourshaghaghi HR. Determining the necessity of fault tolerance techniques in FPGA devices for space missions. Microprocess Microsyst. 2018;63:1–10.

    Article  Google Scholar 

  7. 7.

    Pratt B, Caffrey M, Graham P, Morgan K, Wirthlin M. Improving FPGA design robustness with partial TMR. In: 2006 IEEE ınternational reliability physics symposium proceedings, IEEE, pp. 226–232 (2006)

  8. 8.

    Harten V, Khatri AR, Hayek A, Börcsök J. Validation of the proposed hardness analysis technique for FPGA designs to improve reliability and fault-tolerance. Int J Adv Comput Sci Appl. 2018;9(12):1–8.

    Google Scholar 

  9. 9.

    Gu C, Hanley N, O’neill M. Improved reliability of FPGA-based PUF identification generator design. ACM Trans Reconfig Technol Syst. 2017;10(3):1–23.

    Article  Google Scholar 

  10. 10.

    Anwer J, Platzner M. Evaluating fault-tolerance of redundant FPGA structures using Boolean difference calculus. Microprocess Microsyst. 2017;52:160–72.

    Article  Google Scholar 

  11. 11.

    Majzik I, Pataricza A, Bondavalli A. Stochastic dependability analysis of system architecture based on UML models. Archit Depend Syst LNCS. 2003;2677:219–219.

    Article  Google Scholar 

  12. 12.

    Sadi MS, Myers DG, Sanchez CO, Jurjens J. Component criticality analysis to minimizing soft errors risk. Comput Syst Sci Eng. 2010;26(1):377–91.

    Google Scholar 

  13. 13.

    Weulersse C, et al. Contribution of thermal neutrons to soft error rate. IEEE Trans Nucl Sci. 2018;65(8):1851–7.

    Article  Google Scholar 

  14. 14.

    Jung D, Sharma A, Jung J. A review of soft errors and the low α-solder bumping process in 3-D packaging technology. J Mater Sci. 2018;53(1):47–65.

    Article  Google Scholar 

  15. 15.

    Irom F, et al. Single-event upset in evolving commercial silicon-on-insulator microprocessor technologies. IEEE Trans Nucl Sci. 2003;50(6):2107–12.

    Article  Google Scholar 

  16. 16.

    Baumann RC. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Trans Device Mater Reliab. 2005;5(3):305–16.

    Article  Google Scholar 

  17. 17.

    Mukherjee S, Emer J, Reinhardt SK. The soft error problem: an architectural perspective. In: 11th International symposium on high-performance computer architecture, IEEE (2005)

  18. 18.

    Park S, Li S, Mahlke S. Low cost transient fault protection using loop output prediction. In: 2018 48th Annual IEEE/IFIP international conference on dependable systems and networks workshops (DSN-W), IEEE (2018)

  19. 19.

    Mukherjee SS, Kontz M, Reinhardt SK. Detailed design and evaluation of redundant multi-threading alternatives. In Proceedings 29th annual international symposium on computer architecture, IEEE (2002)

  20. 20.

    Diehl S, et al. Error analysis and prevention of cosmic ion-induced soft errors in static CMOS RAMs. IEEE Trans Nucl Sci. 1982;29(6):2032–9.

    Article  Google Scholar 

  21. 21.

    Liu MN. Low power SEU immune CMOS memory circuits. IEEE Trans Nucl Sci. 1992;39(6):1679–84.

    Article  Google Scholar 

  22. 22.

    Calin T. Upset hardened memory design for submicron CMOS technology. IEEE Trans Nucl Sci. 1996;43(6):2874–8.

    Article  Google Scholar 

  23. 23.

    Gomaa M et al. Transient-fault recovery for chip multiprocessors. In: 30th Annual international symposium on computer architecture, 2003. Proceedings of IEEE (2003)

  24. 24.

    Srinivasan J, et al. The case for lifetime reliability-aware microprocessors. ACM SIGARCH Comput Archit News. 2004;32(2):276.

    Article  Google Scholar 

  25. 25.

    Rashid MW, et al. Power-efficient error tolerance in chip multiprocessors. IEEE Micro. 2005;25(6):60–70.

    Article  Google Scholar 

  26. 26.

    Bowles JB. An assessment of RPN prioritization in a failure modes effects and criticality analysis. In: Annual reliability and maintainability symposium, 2003, IEEE (2003).

  27. 27.

    Military Standard, US. Procedures for performing a failure mode, effects and criticality analysis. MIL-STD-1629A. 1980.

  28. 28.

    Bowles JB. The new SAE FMECA standard. In: Annual reliability and maintainability symposium. 1998 Proceedings. International symposium on product quality and integrity, IEEE (1998)

  29. 29.

    Avizienis A, et al. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans Depend Secure Comput. 2004;1(1):11–33.

    Article  Google Scholar 

  30. 30.

    Nguyen HT, et al. Chip-level soft error estimation method. IEEE Trans Device Mater Reliab. 2005;5(3):365–81.

    Article  Google Scholar 

  31. 31.

    Yacoub SM, Ammar HH. A methodology for architecture-level reliability risk analysis. IEEE Trans Softw Eng. 2002;28(6):529–47.

    Article  Google Scholar 

  32. 32.

    Wagner S, Jürjens J. Model-based identification of fault-prone components. In: European dependable computing conference, Springer (2005)

  33. 33.

    Hosseini SM, et al. Reprioritization of failures in a system failure mode and effects analysis by decision making trial and evaluation laboratory technique. Reliab Eng Syst Saf. 2006;91:872–81.

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Muhammad Sheikh Sadi.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sadi, M.S., Ahmed, W. & Jürjens, J. Towards Tolerating Soft Errors for Embedded Systems. SN COMPUT. SCI. 2, 101 (2021). https://doi.org/10.1007/s42979-021-00497-9

Download citation

Keywords

  • Criticality analysis
  • Embedded systems
  • Heuristic metrics
  • Reliability risks
  • Soft errors