Abstract
As a result of the increased importance of engineered electrical systems to modern civilization, it is necessary to design systems that sustain ideal levels of performance despite the potential for internal faults and external attacks. Designing systems that exhibit resilience, also known as fault tolerance, is the primary method by which optimal performance is preserved despite adverse conditions. This paper is a review of a variety of computational and electromechanical fault tolerance techniques from the literature in order to evaluate the state of the art and identify potential areas for improvement. Our findings suggest that the existing literature has only focused on a limited number of resilience challenges, and that no single resilience-enhancing solution, either hardware- or software-based, is capable of addressing all of the major types of possible faults. Further, we classify the papers using the resilience matrix, which combines four resilience phases put forth by the National Academy of Sciences and four Network Centric Warfare domains. We identify the matrix components insufficiently addressed: particularly, we have found no relevant literature on the cognitive and social domains. Even within the parts of the resilience matrix that have received attention in the literature to date, we observe that there is relatively less emphasis placed on the adaptation of the computational and electromechanical systems so that a repeated fault will not incur significant disruption in subsequent occurrences. Therefore, based on this review, we find that while significant and sustained attention has been dedicated to enhance the resilience of engineering electrical systems, substantial work remains to fully address resilience challenges that instill confidence in our ability to engineer resilient systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Further Suggested Readings
Agarwal M, Paul BC, Zhang M, Mitra S (2007) Circuit failure prediction and its application to transistor aging. In 25th IEEE VLSI test symposium (VTS’07); pp 277–286
Alberts DS (2002) Information age transformation. getting to a 21st century military, Revised.; Washington, DC
Alena R, Ellis SR, Hieronymus J, Maclise D (2008) Wireless avionics and human interfaces for inflatable spacecraft. In IEEE aerospace conference proceedings
Alena R, Gilstrap R, Baldwin J, Stone T, Wilson P (2011) Fault tolerance in ZigBee wireless sensor networks. In IEEE aerospace conference proceedings; pp 1–15
Avižienis A (1967) Design of fault-tolerant computers. In Proceedings of the November 14–16, 1967, Fall joint computer conference; pp 733–743
Avižienis A (1997) Toward systematic design of fault-tolerant systems. Computer (Long Beach Calif) 30(4):51–58
Banerjee N, Karakonstantis G, Roy K (2007) Process variation tolerant low power DCT architecture. In Proceedings – Design, Automation and Test in Europe, DATE’07; Vol. 7, pp 630–635
Bartlett W, Spainhower L (2004) Commercial fault tolerance: a tale of two systems. IEEE Trans Dependable Secur Comput 1(1):87–96
Bau J, Hankins R, Jacobson Q, Mitra S, Saha B, Adl-Tabatabai A-R (2009) Error resilient system architecture (ERSA) for probabilistic applications. In Proceedings of the international symposium on low power electronics and design
Bodeau DJ, Graubart R (2011) MITRE cyber resiliency engineering framework, MTR110237; Bedford
Borkar S (2005) Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6):10–16
Bowman KA, Tschanz JW, Kim NS, Lee JC, Wilkerson CB, Lu S-LL, Karnik T, De VK (2009a) Energy-efficient and metastability-immune resilient circuits for dynamic variation tolerance. IEEE J Solid-State Circuits 44(1):49–63
Bowman K, Tschanz J, Wilkerson C, Lu S-L, Karnik T, De V, Borkar S (2009b) Circuit techniques for dynamic variation tolerance. In Design Automation Conference, 2009. DAC ‘09. 46th ACM/IEEE; pp 4–7
Bowman KA, Tschanz JW, Lu S-LL, Aseron PA, Khellah MM, Raychowdhury A, Geuskens BM, Tokunaga C, Wilkerson CB, Karnik T, De VK (2011) A 45 Nm resilient microprocessor core for dynamic variation tolerance. IEEE J Solid-State Circuits 46(1):194–208
Breuer MA (2005) Multi-media applications and imprecise computation. In Proceedings – DSD’2005: 8th euromicro conference on digital system design – architectures, methods and tools; pp 2–7
Brunina D, Lai CP, Liu D, Garg AS, Bergman K (2012) Resilient optically connected memory systems using dynamic bit-steering [Invited]. J Opt Commun Netw 4(11):B151
Chakrapani LN, Akgul BES, Cheemalavagu S, Korkmaz P, Palem KV, Seshasayee B (2006) Ultra-efficient (Embedded) SOC architectures based on probabilistic CMOS (PCMOS) technology. In Proceedings – design, automation and test in Europe, DATE’06; pp 1110–1115
Chen M, Trachtenberg EA (1991) Permutation codes for the state assignment of fault tolerant sequential machines. In Proceedings Of The 10th digital avionics systems conference; pp 85–90
Chippa VK, Mohapatra D, Raghunathan A, Roy K, Chakradhar ST (2010) Scalable effort hardware design: exploiting algorithmic resilience for energy efficiency. In Design Automation Conference (DAC), 2010 47th ACM/IEEE; pp 555–560
Department of Homeland Security. Critical Infrastructure Sectors (n.d.) https://www.dhs.gov/critical-infrastructure-sectors
DiMase D, Collier ZA, Heffner K, Linkov I (2015) Systems engineering framework for cyber physical security and resilience. Environ Syst Decis 35(2):291–300
Disaster Resilience: A National Imperative (2012) The National Academies Press: Washington, DC
Dolev S, Haviv YA (2006) Self-stabilizing microprocessor: analyzing and overcoming soft errors. IEEE Trans Comput 55(4):385–399
Fang L, Yamagata Y, Oiwa Y (2014) Evaluation of a resilience embedded system using probabilistic model-checking. In Electronic proceedings in theoretical computer science; Vol. 150, pp 35–49
Galster N, Frecker M, Carroll E, Vobecky J, Hazdra P (1998) Application-specific fast-recovery diodes: design and performance. In Power Conversion April 1998 Proceedings; pp 1–14
Gaubatz G, Savaş E, Sunar B (2008) Sequential circuit design for embedded cryptographic applications resilient to adversarial faults. IEEE Trans Comput 57(1):126–138
Hayes JP, Polian I, Becker B (2007) An analysis framework for transient-error tolerance. In Proceedings of the IEEE VLSI test symposium; pp 249–255
Hazucha, P.; Kamikl, T.; Walstra, S.; Bloechell, B.; Tschanzl, J.; Maiz, J.; Soumyanath, K.; Demer, G.; Narendra, S.; De, V.; Borkar, S. (2003) Measurements and analysis of SER tolerant latch in a 90 nm Dual-Vt CMOS Process. In IEEE 2003 custon integrated circuits conference; pp 617–620
Hsieh T-Y, Lee K-J, Breuer MA (2008) An error rate based test methodology to support error-tolerance. IEEE Trans Reliab 57(1):204–214
Huang W-J, Saxena N, McCluskey EJ (2000) Reliable LZ data compressor on reconfigurable coprocessors; pp 249–258
Kang K, Kim K, Roy K (2007) Variation resilient low-power circuit design methodology using on-chip phase locked loop. In ACM/IEEE Design Automation Conference; pp 934–939
Leem L, Cho H, Bau J, Jacobson QA, Mitra S (2010) ERSA: error resilient system architecture for probabilistic applications. In Design, Automation Test in Europe Conference Exhibition, 2010; pp 1560–1565
Li X, Yeung D (2006) Exploiting soft computing for increased fault tolerance. In Workshop on architectural support for Gigascale integration
Lima F, Rezgui S, Carro L, Velazco R, Reis R (2001) On the use of VHDL simulation and emulation to derive error rates. In 6th European Conference on Radiation and Its Effects on Components and Systems; pp 253–260
Linkov I, Eisenberg DA, Bates ME, Chang D, Convertino M, Allen JH, Flynn SE, Seager TP (2013) Measurable resilience for actionable policy. Environ Sci Technol 47:10108–10110
Liu N, Whitaker S (1992) Low power SEU immune CMOS memory circuits. IEEE Trans Nucl Sci 39(6):1679–1684
Maciejewski AA (1990). Fault tolerant properties of kinematically redundant manipulators. In IEEE Conference on Robotics and Automation; pp 638–642
Merlin MMC, Green TC, Mitcheson PD, Trainer DR, Critchley R, Crookes W, Hassan F (2014) The alternate arm converter: a new hybrid multilevel converter with DC-fault blocking capability. IEEE Trans Power Deliv 29(1):310–317
Meshram SS, Belorkar UA (2011) Design approach for fault tolerance in FPGA architecture. Int J VLSI Des Commun Syst 2(1):87–95
Mitra S, Seifert N, Zhang M, Shi Q, Kim KS (2005) Robust system design with built-in soft-error resilience. Computer (Long. Beach. Calif). No. February, 43–52
Mitra S, Zhang M, Seifert N, Mak TM, Kim KS (2007) Built-in soft error resilience for robust system design. In IEEE International Conference on Integrated Circuit Design and Technology; 2007; pp 1–6
Mukherjee SS, Kontz M, Reinhardt SK (2002) Detailed Design and Evaluation of Redundant Multithreading Alternatives. In 29th Annual International Symposium on Computer Architecture; pp 99–110
Nassif SR, Mehta N, Cao Y (2010) A resilience roadmap. In Design, Automation &Test in Europe Conference & Exhibition; pp 1011–1016
Nickel JB, Somani AK (2001) REESE: a method of soft error detection in microprocessors. In Proceedings of the International Conference on Dependable Systems and Networks; pp 401–410
Nicolaidis M (1999) Time redundancy based soft-error tolerance to rescue nanometer technologies. In Proceedings 17th IEEE VLSI Test Symposium
Normand E (1996) Single event upset at ground level. IEEE Trans Nucl Sci 43(6):2742–2750
Oh N, Mitra S, McCluskey EJ (2002a) ED4I: error detection by diverse data and duplicated instructions. IEEE Trans Comput 51(2):180–199
Oh N, Shirvani PP, McCluskey EJ (2002b) Error detection by duplicated instructions in super-scalar processors. IEEE Trans Reliab 51(1):63–75
Pradeep AK, Yoder PJ, Mukundan R, Schilling RJ (1988) Crippled motion in Robots. IEEE Trans Aerosp Electron Syst 24(1):2–13
Presidential Policy Directive – Critical Infrastructure Security and Resilience. https://www.whitehouse.gov/the-press-office/2013/02/12/presidential-policy-directive-critical-infrastructure-security-and-resil. n.d.
Reddi VJ, Pan DZ, Nassif SR, Bowman KA (2012) Robust and resilient designs from the bottom-up: technology, CAD, circuit, and system issues. In Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC; pp 7–16
Rennels DA (1978) Architectures for fault-tolerant spacecraft computers. Proc IEEE 66(10):1255–1268
Richardeau F, Baudesson P, Meynard TA (2002) Failures-tolerance and remedial strategies of a PWM multicell inverter. IEEE Trans Power Electron 17(6):905–912
Roche P, Gasiot G (2005) Impacts of front-end and middle-end process modifications on terrestrial soft error rate. IEEE Trans Device Mater Reliab 5(3):382–395
Rockett LR (1992) Simulated SEU hardened scaled CMOS SRAM cell design using gated resistors. IEEE Trans Nucl Sci 39(5):1532–1541
Roege PE, Collier ZA, Mancillas J, McDonagh JA, Linkov I (2014) Metrics for energy resilience. Energy Policy 72:249–256
Rotenberg E (1999). AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing; pp 84–91
Sanda PN, Kellington JW, Kudva P, Kalla R, McBeth RB, Ackaret J, Lockwood R, Schumann J, Jones CR (2008) Soft-error resilience of the IBM POWER6 processor. IBM J Res Dev 52(3):275–284
Saxena N, Fernandez-Gomez S, Huang W, Mlra S, Ya S-Y, Mccluskey EJ (2000) Dependable computing and online testing in adaptive and configurable systems. IEEE Des Test Comput 17:29–41
Scott A, Menn J (2014) Exclusive: air traffic system failure caused by computer memory shortage. Reuters
Seshia SA, LiW, Mitra S (2007) Verification-guided soft error resilience. In Proceedings of the conference on design, automation and test in Europe; pp 1442–1447
Stelloh T, Gutierrez G (2016) Georgia power company disputes “Outage” behind delta’s system failure. NBC News
Touba NA, McCluskey EJ (1997) Logic synthesis of multilevel circuits with concurrent error detection. IEEE Trans Comput Des Integr Circuits Syst 16(7):783–789
Tschanz J, Bowman K, Wilkerson C, Lu S-L, Karnik T (2009) Resilient circuits – enabling energy-efficient performance and reliability. In Proceedings of the 2009 International Conference on Computer-Aided Design - ICCAD ‘09; pp 71–73
Ullah A, Sterpone L (2014) Recovery time and fault tolerance improvement for circuits mapped on SRAM-based FPGAs. J Electron Test Theory Appl 30(4):425–442
Vishwanath KV, Nagappan N (2010) Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM Symposium on Cloud Computing - SoCC ‘10; pp 193–203
Visinsky ML, Cavallaro JR, Walker ID (1994) Expert system framework for fault detection and fault tolerance in robotics. Comput Electr Eng 20(5):421–435
Walters JP, Kost R, Singh K, Suh J, Crago SP (2011) Software-based fault tolerance for the maestro many-core processor. In IEEE aerospace conference proceedings; pp 1–12
Wong V, Horowitz M (2006) Soft error resilience of probabilistic inference applications. In IEEE workshop on silicon errors in logic; pp 1–4
Yoshimoto S, Amashita T, Okumura S, Nii K, Yoshimoto M, Kawaguchi H (2012) Bit-error and soft-error resilient 7T/14T SRAM with 150-Nm FD-SOI Process. IEICE Trans Fundam Electron Commun Comput Sci E95–A (8), 1359–1365
Yu S-Y, Saxena N, McCluskey EJ (2000) An ACS Robotic control algorithm with fault tolerant capabilities. In IEEE Symposium on FPGAs for custom computing machines, Proceedings pp 175–184
Zhang M, Mitra S, Member S, Mak TM, Seifert N, Wang NJ, Shi Q, Kim KS, Shanbhag NR, Patel SJ (2006) Sequential element design with built-in soft error resilience. IEEE Trans Very Large Scale Integr Syst 14(12):1368–1378
Ziegler JF, Lanford WA (1979) Effect of cosmic rays on computer memories. Science (80-. ). 206 (4420), 776–788
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media B.V.
About this paper
Cite this paper
Zussblatt, N.P., Ganin, A.A., Larkin, S., Fiondella, L., Linkov, I. (2017). Resilience and Fault Tolerance in Electrical Engineering. In: Linkov, I., Palma-Oliveira, J. (eds) Resilience and Risk. NATO Science for Peace and Security Series C: Environmental Security. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-1123-2_16
Download citation
DOI: https://doi.org/10.1007/978-94-024-1123-2_16
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-024-1122-5
Online ISBN: 978-94-024-1123-2
eBook Packages: Computer ScienceComputer Science (R0)