Skip to main content

On-Demand Fault Tolerance on Massively Parallel Processor Arrays

  • Chapter
  • First Online:
Invasive Tightly Coupled Processor Arrays

Part of the book series: Computer Architecture and Design Methodologies ((CADM))

  • 439 Accesses

Abstract

In this chapter, we present for the first time (a) a systematic and holistic method to realise on-demand fault tolerance support on Tightly Coupled Processor Arrays (TCPAs) rather than single processors. Here, we propose (b) different level of replications, i. e., no replication, Dual Modular Redundancy (DMR), and Triple Modular Redundancy (TMR), with different capabilities for error handling for TCPAs. Here, a major contribution is to (c) apply these individual replication schemes based on a our novel reliability calculus for each of the proposed replication schemes and based on environmental conditions such as monitored Soft Error Rates (SERs) on the system. The strength of our reliability analysis is the usage of application execution characteristics that we derive from the compilation process. This will guide a system to transparently adopt suitable fault tolerance techniques upon application needs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    \(P(N_{f} > N_{f}^{max}, t) < 10^{-35}\) of accuracy is smaller than the minimum value of the probability of failure in our evaluations in Sect. 4.6.2.

References

  1. Baumann RC (2005) Radiation-induced soft errors in advanced semiconductor technologies. IEEE Trans Device Mater Reliab 5(3):305–316

    Article  MathSciNet  Google Scholar 

  2. Alexandrescu D (2011) A comprehensive soft error analysis methodology for socs/asics memory instances. In: Proceedings of the international on-line testing symposium (IOLTS), pp 175–176.doi:10.1109/IOLTS.2011.5993833

  3. Constantinescu C (2003) Trends and challenges in vlsi circuit reliability. IEEE Micro 23(4):14–19. doi:10.1109/MM.2003.1225959 ISSN 0272-1732

    Article  Google Scholar 

  4. Ziegler J, Lanford W (1979) Effect of cosmic rays on computer memories. Science 206(4420):776–788. doi:10.1126/science.206.4420.776

    Article  Google Scholar 

  5. Nelson V (1990) Fault-tolerant computing: fundamental concepts. Computer 23(7):19–25. doi:10.1109/2.56849 ISSN 0018-9162

    Article  Google Scholar 

  6. Lari V, Tanase A, Hannig F, Teich J (2014) Massively parallel processor architectures for resource-aware computing. In: Proceedings of the first workshop on resource awareness and adaptivity in multi-core computing (Racing), pp 1–7, May 2014

    Google Scholar 

  7. Lari V, Teich J, Tanase A, Witterauf M, Khosravi F, Meyer B (2015) Techniques for on-demand structural redundancy for massively parallel processor arrays. J Syst Arch (JSA), 61(10):615–627. ISSN 1383-7621. http://dx.doi.org/10.1016/j.sysarc.2015.10.004

    Google Scholar 

  8. Lari V, Tanase A, Teich J, Witterauf M, Khosravi F, Hannig F, Meyer B (2015) Co-design approach for fault-tolerant loop execution on coarse-grained reconfigurable arrays. In Proceedings of the NASA/ESA conference on adaptive hardware and systems (AHS), June 2015

    Google Scholar 

  9. Tanase A, Witterauf M, Teich J, Hannig F, Lari V (2015) On-demand fault-tolerant loop processing on massively parallel processor arrays. In Proceedings of the IEEE international conference on application-specific systems, architectures and processors (ASAP), July 2015

    Google Scholar 

  10. Gall H (2008) Functional safety iec 61508 / iec 61511 the impact to certification and the user. In: IEEE/ACS international conference on computer systems and applications, AICCSA 2008, Mar 2008, pp 1027–1031. doi:10.1109/AICCSA.2008.4493673

  11. Mitra S, McCluskey EJ (2000) Word-voter: a new voter design for triple modular redundant systems. In: Proceedings of the 18th IEEE VLSI test symposium, 2000, pp 465–470. doi:10.1109/VTEST.2000.843880

  12. Teich J, Tanase A, Hannig F (2013) Symbolic parallelization of loop programs for massively parallel processor arrays. In: Proceedings of the IEEE International conference on application-specific systems, architectures and processors (ASAP), IEEE, June 2013, pp 1–9. ISBN 978-1-4799-0493-8. doi:10.1109/ASAP.2013.6567543. Best Paper Award

  13. Jacobs A, Cieslewski G, George AD, Gordon-Ross A, Lam H (2012) Reconfigurable fault tolerance: a comprehensive framework for reliable and adaptive FPGA-based space computing. ACM Trans Reconfigurable Technol Syst (TRETS), 5(4):21:1–21:30. ISSN 1936-7406. doi:10.1145/2392616.2392619

    Google Scholar 

  14. I. O. f. S. ISO. International standard 26262: Road vehicles functional safety. international standard., 2011. Edition: 2011

    Google Scholar 

  15. Ebrahimi M, Evans A, Tahoori M, Seyyedi R, Costenaro E, Alexandrescu D (2014) Comprehensive analysis of alpha and neutron particle-induced soft errors in an embedded processor at nanoscales. In: Proceedings of the conference on design, automation and test in Europe (DATE), pp 30:1–30:6, 3001 Leuven, Belgium, Belgium, 2014. European Design and Automation Association. ISBN 978-3-9815370-2-4

    Google Scholar 

  16. Iyer R, Nakka N, Kalbarczyk Z, Mitra S (2005) Recent advances and new avenues in hardware-level reliability support. IEEE Micro 25(6):18–29

    Article  Google Scholar 

  17. Nicolaidis M (1999) Time redundancy based soft-error tolerance to rescue nanometer technologies. In: Proceedings of the IEEE VLSI test symposium (VTS), IEEE, April 1999, pp 86–94. doi:10.1109/VTEST.1999.766651

  18. Mitra S, Zhang M, Waqas S, Seifert N, Gill B, Kim K (2006) Combinational logic soft error correction. In: Proceedings of the IEEE international test conference (ITC), IEEE, October 2006, pp 1–9. doi:10.1109/TEST.2006.297681

  19. Ernst D, Kim N, Das S, Pant S, Rao R, Pham T, Ziesler C, Blaauw D, Austin T, Flautner K, Mudge T (2003) Razor: a low-power pipeline based on circuit-level timing speculation. In: Proceedings of the annual IEEE/ACM international symposium on microarchitecture (MICRO), IEEE, December 2003, pp 7–18. doi:10.1109/MICRO.2003.1253179

  20. Reinhardt S, Mukherjee S (2000) Transient fault detection via simultaneous multithreading. ACM SIGARCH Comput Archi News 28(2):25–36. ISSN 0163-5964. doi:10.1145/342001.339652

    Google Scholar 

  21. Vijaykumar T, Pomeranz I, Cheng K (2002) Transient-fault recovery using simultaneous multithreading. In: Proceedings of the annual international symposium on computer architecture (ISCA), ISCA ’02, pp 87–98, Washington, DC, USA, May 2002. IEEE Computer Society. ISBN 0-7695-1605-X. doi:10.1109/ISCA.2002.1003565

  22. Ray J, Hoe JC, Falsafi B (2001) Dual use of superscalar datapath for transient-fault detection and recovery. In: Proceedings of the annual IEEE/ACM international symposium on microarchitecture (MICRO), IEEE, December 2001, pp 214–224. doi:10.1109/MICRO.2001.991120

  23. Qureshi M, Mutlu O, Patt Y (2005) Microarchitecture-based introspection: a technique for transient-fault tolerance in microprocessors. In: Proceedings of the international conference on dependable systems and networks (DSN), IEEE, June 2005, pp 434–443. ISBN 0-7695-2282-3. doi:10.1109/DSN.2005.62

  24. Oh N, Shirvani P, McCluskey E (2002) Error detection by duplicated instructions in super-scalar processors. IEEE Trans Reliab 51(1):63–75, March 2002. ISSN 0018-9529. doi:10.1109/24.994913

    Google Scholar 

  25. Reis G, Chang J, Vachharajani N, Rangan R, August D (2005) SWIFT: software implemented fault tolerance. In: Proceedings of the international symposium on code generation and optimization (CGO), IEEE Computer Society, March 2005, pp 243–254. ISBN 0-7695-2298-X. doi:10.1109/CGO.2005.34

  26. Khudia D, Wright G, Mahlke S (2012) Efficient soft error protection for commodity embedded microprocessors using profile information. ACM SIGPLAN Not (LCTES) 47(5):99–108, June 2012. ISSN 0362-1340. doi:10.1145/2345141.2248433

  27. Gomaa M, Scarbrough C, Vijaykumar T, Pomeranz I (2003) Transient-fault recovery for chip multiprocessors. In: Proceedings of the annual international symposium on computer architecture (ISCA), IEEE, June 2003, pp 98–109. doi:10.1109/ISCA.2003.1206992

  28. Mukherjee S, Kontz M, Reinhardt S (2002) Detailed design and evaluation of redundant multi-threading alternatives. In: Proceedings of the annual international symposium on computer architecture (ISCA), IEEE, May 2002, pp 99–110. doi:10.1109/ISCA.2002.1003566

  29. Smolens J, Gold B, Falsafi B, Hoe J (2006) Reunion: complexity-effective multicore redundancy. In: Proceedings of the annual IEEE/ACM international symposium on microarchitecture (MICRO), IEEE Computer Society, December 2006, pp 223–234. doi:10.1109/MICRO.2006.42

  30. Jafri S, Piestrak S, Sentieys O, Pillement S (2010) Design of a fault-tolerant coarse-grained reconfigurable architecture: a case study. In: Proceedings of the international symposium on quality electronic design (ISQED), IEEE 2010, pp 845–852. doi:10.1109/ISQED.2010.5450481

  31. Pillement S, Sentieys O, David R (2008) Dart: a functional-level reconfigurable architecture for high energy efficiency. EURASIP J Embed Syst 2008:5. ISSN 1687–3955: doi:10.1155/2008/562326

    Google Scholar 

  32. Azeem MM, Piestrak SJ, Sentieys O, Pillement S (2011) Error recovery technique for coarse-grained reconfigurable architectures. In: Proceedings of the IEEE 14th international symposium on design and diagnostics of electronic circuits systems (DDECS), April 2011, pp 441–446. doi:10.1109/DDECS.2011.5783133

  33. Schweizer T, Schlicker P, Eisenhardt S, Kuhn T, Rosenstiel W (2011) Low-cost tmr for fault-tolerance on coarse-grained reconfigurable architectures. In Proceedings of the international conference on reconfigurable computing and FPGAs (ReConFig), IEEE, November 2011, pp 135–140. doi:10.1109/ReConFig.2011.57

  34. Schweizer T, Kuster A, Eisenhardt S, Kuhn T, Rosenstiel W (2012) Using run-time reconfiguration to implement fault-tolerant coarse grained reconfigurable architectures. In Proceedings of the IEEE international parallel and distributed processing symposium workshops & PhD Forum (IPDPSW), IEEE, May 2012, pp 320–32. ISBN 978-1-4673-0974-5. doi:10.1109/IPDPSW.2012.39

  35. Alnajiar D, Ko Y, Imagawa T, Konoura H, Hiromoto M, Mitsuyama Y, Hashimoto M, Ochi H, Onoye T (2009) Coarse-grained dynamically reconfigurable architecture with flexible reliability. In Proceedings of the international conference on field programmable logic and applications (FPL), IEEE, August 2009, pp 186–192. doi:10.1109/FPL.2009.5272317

  36. Imagawa T, Tsutsui H, Ochi H, Sato T. A cost-effective selective tmr for heterogeneous coarse-grained reconfigurable architectures based on dfg-level vulnerability analysis. In: Proceedings of the conference on design, automation and test in Europe (DATE), DATE ’13, San Jose, CA, USA, 2013, pp 701–706. EDA Consortium. ISBN 978-1-4503-2153-2

    Google Scholar 

  37. Gong C, Melhem R, Gupta R (1996) Loop transformations for fault detection in regular loops on massively parallel systems. IEEE Trans. Parallel Distrib. Syst. 7(12): 1238–1249. ISSN 1045-9219. doi:10.1109/71.553273

    Google Scholar 

  38. Han K, Lee G, Choi K (2014) Software-level approaches for tolerating transient faults in a coarse-grainedreconfigurable architecture. IEEE Trans Dependable Secure Comput 11(4): 392–398. ISSN 1545-5971. doi:10.1109/TDSC.2013.54

    Google Scholar 

  39. Hu J, Li F, Degalahal V, Kandemir M, Vijaykrishnan N, Irwin M (2009) Compiler-assisted soft error detection under performance and energy constraints in embedded systems. ACM Trans Embed Comput Syst (TECS) 8(4):27:1–27:30. ISSN 1539-9087. doi:10.1145/1550987.1550990

    Google Scholar 

  40. Bolchini C (2003) A software methodology for detecting hardware faults in VLIW data paths. IEEE Trans Reliab 52(4):458–468. ISSN 0018-9529. doi:10.1109/TR.2003.821935

    Google Scholar 

  41. Pillai A, Zhang W, Kagaris D (2007) Detecting vliw hard errors cost-effectively through a software-based approach. In: Proceedings of the international conference on advanced information networking and applications workshops (AINAW), IEEE, vol 1, May 2007, pp 811–815. ISBN 978-0-7695-2847-2. doi:10.1109/AINAW.2007.152

  42. Yao J, Saito M, Okada S, Kobayashi K, Nakashima Y (2014) Erela: A low-power reliable coarse-grained reconfigurable architecture processor and its irradiation tests. IEEE Trans Nucl Sci 61(6):3250–3257. ISSN 0018-9499. doi:10.1109/TNS.2014.2367541

    Google Scholar 

  43. Yao J, Nakashima Y, Saito M, Hazama Y, Yamanaka R (2014) A flexible, self-tuning, fault-tolerant functional unit array processor. IEEE Micro 34(6):54–63. ISSN 0272-1732. doi:10.1109/MM.2014.92

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vahid Lari .

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media Singapore

About this chapter

Cite this chapter

Lari, V. (2016). On-Demand Fault Tolerance on Massively Parallel Processor Arrays. In: Invasive Tightly Coupled Processor Arrays. Computer Architecture and Design Methodologies. Springer, Singapore. https://doi.org/10.1007/978-981-10-1058-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-1058-3_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-1057-6

  • Online ISBN: 978-981-10-1058-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics