Skip to main content

Resilience and Fault Tolerance in Electrical Engineering

  • Conference paper
  • First Online:
Resilience and Risk

Abstract

As a result of the increased importance of engineered electrical systems to modern civilization, it is necessary to design systems that sustain ideal levels of performance despite the potential for internal faults and external attacks. Designing systems that exhibit resilience, also known as fault tolerance, is the primary method by which optimal performance is preserved despite adverse conditions. This paper is a review of a variety of computational and electromechanical fault tolerance techniques from the literature in order to evaluate the state of the art and identify potential areas for improvement. Our findings suggest that the existing literature has only focused on a limited number of resilience challenges, and that no single resilience-enhancing solution, either hardware- or software-based, is capable of addressing all of the major types of possible faults. Further, we classify the papers using the resilience matrix, which combines four resilience phases put forth by the National Academy of Sciences and four Network Centric Warfare domains. We identify the matrix components insufficiently addressed: particularly, we have found no relevant literature on the cognitive and social domains. Even within the parts of the resilience matrix that have received attention in the literature to date, we observe that there is relatively less emphasis placed on the adaptation of the computational and electromechanical systems so that a repeated fault will not incur significant disruption in subsequent occurrences. Therefore, based on this review, we find that while significant and sustained attention has been dedicated to enhance the resilience of engineering electrical systems, substantial work remains to fully address resilience challenges that instill confidence in our ability to engineer resilient systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Further Suggested Readings

  • Agarwal M, Paul BC, Zhang M, Mitra S (2007) Circuit failure prediction and its application to transistor aging. In 25th IEEE VLSI test symposium (VTS’07); pp 277–286

    Google Scholar 

  • Alberts DS (2002) Information age transformation. getting to a 21st century military, Revised.; Washington, DC

    Google Scholar 

  • Alena R, Ellis SR, Hieronymus J, Maclise D (2008) Wireless avionics and human interfaces for inflatable spacecraft. In IEEE aerospace conference proceedings

    Google Scholar 

  • Alena R, Gilstrap R, Baldwin J, Stone T, Wilson P (2011) Fault tolerance in ZigBee wireless sensor networks. In IEEE aerospace conference proceedings; pp 1–15

    Google Scholar 

  • Avižienis A (1967) Design of fault-tolerant computers. In Proceedings of the November 14–16, 1967, Fall joint computer conference; pp 733–743

    Google Scholar 

  • Avižienis A (1997) Toward systematic design of fault-tolerant systems. Computer (Long Beach Calif) 30(4):51–58

    Google Scholar 

  • Banerjee N, Karakonstantis G, Roy K (2007) Process variation tolerant low power DCT architecture. In Proceedings – Design, Automation and Test in Europe, DATE’07; Vol. 7, pp 630–635

    Google Scholar 

  • Bartlett W, Spainhower L (2004) Commercial fault tolerance: a tale of two systems. IEEE Trans Dependable Secur Comput 1(1):87–96

    Article  Google Scholar 

  • Bau J, Hankins R, Jacobson Q, Mitra S, Saha B, Adl-Tabatabai A-R (2009) Error resilient system architecture (ERSA) for probabilistic applications. In Proceedings of the international symposium on low power electronics and design

    Google Scholar 

  • Bodeau DJ, Graubart R (2011) MITRE cyber resiliency engineering framework, MTR110237; Bedford

    Google Scholar 

  • Borkar S (2005) Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6):10–16

    Article  Google Scholar 

  • Bowman KA, Tschanz JW, Kim NS, Lee JC, Wilkerson CB, Lu S-LL, Karnik T, De VK (2009a) Energy-efficient and metastability-immune resilient circuits for dynamic variation tolerance. IEEE J Solid-State Circuits 44(1):49–63

    Article  Google Scholar 

  • Bowman K, Tschanz J, Wilkerson C, Lu S-L, Karnik T, De V, Borkar S (2009b) Circuit techniques for dynamic variation tolerance. In Design Automation Conference, 2009. DAC ‘09. 46th ACM/IEEE; pp 4–7

    Google Scholar 

  • Bowman KA, Tschanz JW, Lu S-LL, Aseron PA, Khellah MM, Raychowdhury A, Geuskens BM, Tokunaga C, Wilkerson CB, Karnik T, De VK (2011) A 45 Nm resilient microprocessor core for dynamic variation tolerance. IEEE J Solid-State Circuits 46(1):194–208

    Article  Google Scholar 

  • Breuer MA (2005) Multi-media applications and imprecise computation. In Proceedings – DSD’2005: 8th euromicro conference on digital system design – architectures, methods and tools; pp 2–7

    Google Scholar 

  • Brunina D, Lai CP, Liu D, Garg AS, Bergman K (2012) Resilient optically connected memory systems using dynamic bit-steering [Invited]. J Opt Commun Netw 4(11):B151

    Article  Google Scholar 

  • Chakrapani LN, Akgul BES, Cheemalavagu S, Korkmaz P, Palem KV, Seshasayee B (2006) Ultra-efficient (Embedded) SOC architectures based on probabilistic CMOS (PCMOS) technology. In Proceedings – design, automation and test in Europe, DATE’06; pp 1110–1115

    Google Scholar 

  • Chen M, Trachtenberg EA (1991) Permutation codes for the state assignment of fault tolerant sequential machines. In Proceedings Of The 10th digital avionics systems conference; pp 85–90

    Google Scholar 

  • Chippa VK, Mohapatra D, Raghunathan A, Roy K, Chakradhar ST (2010) Scalable effort hardware design: exploiting algorithmic resilience for energy efficiency. In Design Automation Conference (DAC), 2010 47th ACM/IEEE; pp 555–560

    Google Scholar 

  • Department of Homeland Security. Critical Infrastructure Sectors (n.d.) https://www.dhs.gov/critical-infrastructure-sectors

  • DiMase D, Collier ZA, Heffner K, Linkov I (2015) Systems engineering framework for cyber physical security and resilience. Environ Syst Decis 35(2):291–300

    Article  Google Scholar 

  • Disaster Resilience: A National Imperative (2012) The National Academies Press: Washington, DC

    Google Scholar 

  • Dolev S, Haviv YA (2006) Self-stabilizing microprocessor: analyzing and overcoming soft errors. IEEE Trans Comput 55(4):385–399

    Article  Google Scholar 

  • Fang L, Yamagata Y, Oiwa Y (2014) Evaluation of a resilience embedded system using probabilistic model-checking. In Electronic proceedings in theoretical computer science; Vol. 150, pp 35–49

    Google Scholar 

  • Galster N, Frecker M, Carroll E, Vobecky J, Hazdra P (1998) Application-specific fast-recovery diodes: design and performance. In Power Conversion April 1998 Proceedings; pp 1–14

    Google Scholar 

  • Gaubatz G, Savaş E, Sunar B (2008) Sequential circuit design for embedded cryptographic applications resilient to adversarial faults. IEEE Trans Comput 57(1):126–138

    Article  MathSciNet  Google Scholar 

  • Hayes JP, Polian I, Becker B (2007) An analysis framework for transient-error tolerance. In Proceedings of the IEEE VLSI test symposium; pp 249–255

    Google Scholar 

  • Hazucha, P.; Kamikl, T.; Walstra, S.; Bloechell, B.; Tschanzl, J.; Maiz, J.; Soumyanath, K.; Demer, G.; Narendra, S.; De, V.; Borkar, S. (2003) Measurements and analysis of SER tolerant latch in a 90 nm Dual-Vt CMOS Process. In IEEE 2003 custon integrated circuits conference; pp 617–620

    Google Scholar 

  • Hsieh T-Y, Lee K-J, Breuer MA (2008) An error rate based test methodology to support error-tolerance. IEEE Trans Reliab 57(1):204–214

    Article  Google Scholar 

  • Huang W-J, Saxena N, McCluskey EJ (2000) Reliable LZ data compressor on reconfigurable coprocessors; pp 249–258

    Google Scholar 

  • Kang K, Kim K, Roy K (2007) Variation resilient low-power circuit design methodology using on-chip phase locked loop. In ACM/IEEE Design Automation Conference; pp 934–939

    Google Scholar 

  • Leem L, Cho H, Bau J, Jacobson QA, Mitra S (2010) ERSA: error resilient system architecture for probabilistic applications. In Design, Automation Test in Europe Conference Exhibition, 2010; pp 1560–1565

    Google Scholar 

  • Li X, Yeung D (2006) Exploiting soft computing for increased fault tolerance. In Workshop on architectural support for Gigascale integration

    Google Scholar 

  • Lima F, Rezgui S, Carro L, Velazco R, Reis R (2001) On the use of VHDL simulation and emulation to derive error rates. In 6th European Conference on Radiation and Its Effects on Components and Systems; pp 253–260

    Google Scholar 

  • Linkov I, Eisenberg DA, Bates ME, Chang D, Convertino M, Allen JH, Flynn SE, Seager TP (2013) Measurable resilience for actionable policy. Environ Sci Technol 47:10108–10110

    Google Scholar 

  • Liu N, Whitaker S (1992) Low power SEU immune CMOS memory circuits. IEEE Trans Nucl Sci 39(6):1679–1684

    Article  Google Scholar 

  • Maciejewski AA (1990). Fault tolerant properties of kinematically redundant manipulators. In IEEE Conference on Robotics and Automation; pp 638–642

    Google Scholar 

  • Merlin MMC, Green TC, Mitcheson PD, Trainer DR, Critchley R, Crookes W, Hassan F (2014) The alternate arm converter: a new hybrid multilevel converter with DC-fault blocking capability. IEEE Trans Power Deliv 29(1):310–317

    Article  Google Scholar 

  • Meshram SS, Belorkar UA (2011) Design approach for fault tolerance in FPGA architecture. Int J VLSI Des Commun Syst 2(1):87–95

    Article  Google Scholar 

  • Mitra S, Seifert N, Zhang M, Shi Q, Kim KS (2005) Robust system design with built-in soft-error resilience. Computer (Long. Beach. Calif). No. February, 43–52

    Google Scholar 

  • Mitra S, Zhang M, Seifert N, Mak TM, Kim KS (2007) Built-in soft error resilience for robust system design. In IEEE International Conference on Integrated Circuit Design and Technology; 2007; pp 1–6

    Google Scholar 

  • Mukherjee SS, Kontz M, Reinhardt SK (2002) Detailed Design and Evaluation of Redundant Multithreading Alternatives. In 29th Annual International Symposium on Computer Architecture; pp 99–110

    Google Scholar 

  • Nassif SR, Mehta N, Cao Y (2010) A resilience roadmap. In Design, Automation &Test in Europe Conference & Exhibition; pp 1011–1016

    Google Scholar 

  • Nickel JB, Somani AK (2001) REESE: a method of soft error detection in microprocessors. In Proceedings of the International Conference on Dependable Systems and Networks; pp 401–410

    Google Scholar 

  • Nicolaidis M (1999) Time redundancy based soft-error tolerance to rescue nanometer technologies. In Proceedings 17th IEEE VLSI Test Symposium

    Google Scholar 

  • Normand E (1996) Single event upset at ground level. IEEE Trans Nucl Sci 43(6):2742–2750

    Article  Google Scholar 

  • Oh N, Mitra S, McCluskey EJ (2002a) ED4I: error detection by diverse data and duplicated instructions. IEEE Trans Comput 51(2):180–199

    Article  Google Scholar 

  • Oh N, Shirvani PP, McCluskey EJ (2002b) Error detection by duplicated instructions in super-scalar processors. IEEE Trans Reliab 51(1):63–75

    Article  Google Scholar 

  • Pradeep AK, Yoder PJ, Mukundan R, Schilling RJ (1988) Crippled motion in Robots. IEEE Trans Aerosp Electron Syst 24(1):2–13

    Article  Google Scholar 

  • Presidential Policy Directive – Critical Infrastructure Security and Resilience. https://www.whitehouse.gov/the-press-office/2013/02/12/presidential-policy-directive-critical-infrastructure-security-and-resil. n.d.

  • Reddi VJ, Pan DZ, Nassif SR, Bowman KA (2012) Robust and resilient designs from the bottom-up: technology, CAD, circuit, and system issues. In Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC; pp 7–16

    Google Scholar 

  • Rennels DA (1978) Architectures for fault-tolerant spacecraft computers. Proc IEEE 66(10):1255–1268

    Article  Google Scholar 

  • Richardeau F, Baudesson P, Meynard TA (2002) Failures-tolerance and remedial strategies of a PWM multicell inverter. IEEE Trans Power Electron 17(6):905–912

    Article  Google Scholar 

  • Roche P, Gasiot G (2005) Impacts of front-end and middle-end process modifications on terrestrial soft error rate. IEEE Trans Device Mater Reliab 5(3):382–395

    Article  Google Scholar 

  • Rockett LR (1992) Simulated SEU hardened scaled CMOS SRAM cell design using gated resistors. IEEE Trans Nucl Sci 39(5):1532–1541

    Article  Google Scholar 

  • Roege PE, Collier ZA, Mancillas J, McDonagh JA, Linkov I (2014) Metrics for energy resilience. Energy Policy 72:249–256

    Article  Google Scholar 

  • Rotenberg E (1999). AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing; pp 84–91

    Google Scholar 

  • Sanda PN, Kellington JW, Kudva P, Kalla R, McBeth RB, Ackaret J, Lockwood R, Schumann J, Jones CR (2008) Soft-error resilience of the IBM POWER6 processor. IBM J Res Dev 52(3):275–284

    Article  Google Scholar 

  • Saxena N, Fernandez-Gomez S, Huang W, Mlra S, Ya S-Y, Mccluskey EJ (2000) Dependable computing and online testing in adaptive and configurable systems. IEEE Des Test Comput 17:29–41

    Google Scholar 

  • Scott A, Menn J (2014) Exclusive: air traffic system failure caused by computer memory shortage. Reuters

    Google Scholar 

  • Seshia SA, LiW, Mitra S (2007) Verification-guided soft error resilience. In Proceedings of the conference on design, automation and test in Europe; pp 1442–1447

    Google Scholar 

  • Stelloh T, Gutierrez G (2016) Georgia power company disputes “Outage” behind delta’s system failure. NBC News

    Google Scholar 

  • Touba NA, McCluskey EJ (1997) Logic synthesis of multilevel circuits with concurrent error detection. IEEE Trans Comput Des Integr Circuits Syst 16(7):783–789

    Article  Google Scholar 

  • Tschanz J, Bowman K, Wilkerson C, Lu S-L, Karnik T (2009) Resilient circuits – enabling energy-efficient performance and reliability. In Proceedings of the 2009 International Conference on Computer-Aided Design - ICCAD ‘09; pp 71–73

    Google Scholar 

  • Ullah A, Sterpone L (2014) Recovery time and fault tolerance improvement for circuits mapped on SRAM-based FPGAs. J Electron Test Theory Appl 30(4):425–442

    Article  Google Scholar 

  • Vishwanath KV, Nagappan N (2010) Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM Symposium on Cloud Computing - SoCC ‘10; pp 193–203

    Google Scholar 

  • Visinsky ML, Cavallaro JR, Walker ID (1994) Expert system framework for fault detection and fault tolerance in robotics. Comput Electr Eng 20(5):421–435

    Article  Google Scholar 

  • Walters JP, Kost R, Singh K, Suh J, Crago SP (2011) Software-based fault tolerance for the maestro many-core processor. In IEEE aerospace conference proceedings; pp 1–12

    Google Scholar 

  • Wong V, Horowitz M (2006) Soft error resilience of probabilistic inference applications. In IEEE workshop on silicon errors in logic; pp 1–4

    Google Scholar 

  • Yoshimoto S, Amashita T, Okumura S, Nii K, Yoshimoto M, Kawaguchi H (2012) Bit-error and soft-error resilient 7T/14T SRAM with 150-Nm FD-SOI Process. IEICE Trans Fundam Electron Commun Comput Sci E95–A (8), 1359–1365

    Google Scholar 

  • Yu S-Y, Saxena N, McCluskey EJ (2000) An ACS Robotic control algorithm with fault tolerant capabilities. In IEEE Symposium on FPGAs for custom computing machines, Proceedings pp 175–184

    Google Scholar 

  • Zhang M, Mitra S, Member S, Mak TM, Seifert N, Wang NJ, Shi Q, Kim KS, Shanbhag NR, Patel SJ (2006) Sequential element design with built-in soft error resilience. IEEE Trans Very Large Scale Integr Syst 14(12):1368–1378

    Article  Google Scholar 

  • Ziegler JF, Lanford WA (1979) Effect of cosmic rays on computer memories. Science (80-. ). 206 (4420), 776–788

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Igor Linkov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media B.V.

About this paper

Cite this paper

Zussblatt, N.P., Ganin, A.A., Larkin, S., Fiondella, L., Linkov, I. (2017). Resilience and Fault Tolerance in Electrical Engineering. In: Linkov, I., Palma-Oliveira, J. (eds) Resilience and Risk. NATO Science for Peace and Security Series C: Environmental Security. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-1123-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-94-024-1123-2_16

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-024-1122-5

  • Online ISBN: 978-94-024-1123-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics