Abstract
In this chapter, we present for the first time (a) a systematic and holistic method to realise on-demand fault tolerance support on Tightly Coupled Processor Arrays (TCPAs) rather than single processors. Here, we propose (b) different level of replications, i. e., no replication, Dual Modular Redundancy (DMR), and Triple Modular Redundancy (TMR), with different capabilities for error handling for TCPAs. Here, a major contribution is to (c) apply these individual replication schemes based on a our novel reliability calculus for each of the proposed replication schemes and based on environmental conditions such as monitored Soft Error Rates (SERs) on the system. The strength of our reliability analysis is the usage of application execution characteristics that we derive from the compilation process. This will guide a system to transparently adopt suitable fault tolerance techniques upon application needs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
\(P(N_{f} > N_{f}^{max}, t) < 10^{-35}\) of accuracy is smaller than the minimum value of the probability of failure in our evaluations in Sect. 4.6.2.
References
Baumann RC (2005) Radiation-induced soft errors in advanced semiconductor technologies. IEEE Trans Device Mater Reliab 5(3):305–316
Alexandrescu D (2011) A comprehensive soft error analysis methodology for socs/asics memory instances. In: Proceedings of the international on-line testing symposium (IOLTS), pp 175–176.doi:10.1109/IOLTS.2011.5993833
Constantinescu C (2003) Trends and challenges in vlsi circuit reliability. IEEE Micro 23(4):14–19. doi:10.1109/MM.2003.1225959 ISSN 0272-1732
Ziegler J, Lanford W (1979) Effect of cosmic rays on computer memories. Science 206(4420):776–788. doi:10.1126/science.206.4420.776
Nelson V (1990) Fault-tolerant computing: fundamental concepts. Computer 23(7):19–25. doi:10.1109/2.56849 ISSN 0018-9162
Lari V, Tanase A, Hannig F, Teich J (2014) Massively parallel processor architectures for resource-aware computing. In: Proceedings of the first workshop on resource awareness and adaptivity in multi-core computing (Racing), pp 1–7, May 2014
Lari V, Teich J, Tanase A, Witterauf M, Khosravi F, Meyer B (2015) Techniques for on-demand structural redundancy for massively parallel processor arrays. J Syst Arch (JSA), 61(10):615–627. ISSN 1383-7621. http://dx.doi.org/10.1016/j.sysarc.2015.10.004
Lari V, Tanase A, Teich J, Witterauf M, Khosravi F, Hannig F, Meyer B (2015) Co-design approach for fault-tolerant loop execution on coarse-grained reconfigurable arrays. In Proceedings of the NASA/ESA conference on adaptive hardware and systems (AHS), June 2015
Tanase A, Witterauf M, Teich J, Hannig F, Lari V (2015) On-demand fault-tolerant loop processing on massively parallel processor arrays. In Proceedings of the IEEE international conference on application-specific systems, architectures and processors (ASAP), July 2015
Gall H (2008) Functional safety iec 61508 / iec 61511 the impact to certification and the user. In: IEEE/ACS international conference on computer systems and applications, AICCSA 2008, Mar 2008, pp 1027–1031. doi:10.1109/AICCSA.2008.4493673
Mitra S, McCluskey EJ (2000) Word-voter: a new voter design for triple modular redundant systems. In: Proceedings of the 18th IEEE VLSI test symposium, 2000, pp 465–470. doi:10.1109/VTEST.2000.843880
Teich J, Tanase A, Hannig F (2013) Symbolic parallelization of loop programs for massively parallel processor arrays. In: Proceedings of the IEEE International conference on application-specific systems, architectures and processors (ASAP), IEEE, June 2013, pp 1–9. ISBN 978-1-4799-0493-8. doi:10.1109/ASAP.2013.6567543. Best Paper Award
Jacobs A, Cieslewski G, George AD, Gordon-Ross A, Lam H (2012) Reconfigurable fault tolerance: a comprehensive framework for reliable and adaptive FPGA-based space computing. ACM Trans Reconfigurable Technol Syst (TRETS), 5(4):21:1–21:30. ISSN 1936-7406. doi:10.1145/2392616.2392619
I. O. f. S. ISO. International standard 26262: Road vehicles functional safety. international standard., 2011. Edition: 2011
Ebrahimi M, Evans A, Tahoori M, Seyyedi R, Costenaro E, Alexandrescu D (2014) Comprehensive analysis of alpha and neutron particle-induced soft errors in an embedded processor at nanoscales. In: Proceedings of the conference on design, automation and test in Europe (DATE), pp 30:1–30:6, 3001 Leuven, Belgium, Belgium, 2014. European Design and Automation Association. ISBN 978-3-9815370-2-4
Iyer R, Nakka N, Kalbarczyk Z, Mitra S (2005) Recent advances and new avenues in hardware-level reliability support. IEEE Micro 25(6):18–29
Nicolaidis M (1999) Time redundancy based soft-error tolerance to rescue nanometer technologies. In: Proceedings of the IEEE VLSI test symposium (VTS), IEEE, April 1999, pp 86–94. doi:10.1109/VTEST.1999.766651
Mitra S, Zhang M, Waqas S, Seifert N, Gill B, Kim K (2006) Combinational logic soft error correction. In: Proceedings of the IEEE international test conference (ITC), IEEE, October 2006, pp 1–9. doi:10.1109/TEST.2006.297681
Ernst D, Kim N, Das S, Pant S, Rao R, Pham T, Ziesler C, Blaauw D, Austin T, Flautner K, Mudge T (2003) Razor: a low-power pipeline based on circuit-level timing speculation. In: Proceedings of the annual IEEE/ACM international symposium on microarchitecture (MICRO), IEEE, December 2003, pp 7–18. doi:10.1109/MICRO.2003.1253179
Reinhardt S, Mukherjee S (2000) Transient fault detection via simultaneous multithreading. ACM SIGARCH Comput Archi News 28(2):25–36. ISSN 0163-5964. doi:10.1145/342001.339652
Vijaykumar T, Pomeranz I, Cheng K (2002) Transient-fault recovery using simultaneous multithreading. In: Proceedings of the annual international symposium on computer architecture (ISCA), ISCA ’02, pp 87–98, Washington, DC, USA, May 2002. IEEE Computer Society. ISBN 0-7695-1605-X. doi:10.1109/ISCA.2002.1003565
Ray J, Hoe JC, Falsafi B (2001) Dual use of superscalar datapath for transient-fault detection and recovery. In: Proceedings of the annual IEEE/ACM international symposium on microarchitecture (MICRO), IEEE, December 2001, pp 214–224. doi:10.1109/MICRO.2001.991120
Qureshi M, Mutlu O, Patt Y (2005) Microarchitecture-based introspection: a technique for transient-fault tolerance in microprocessors. In: Proceedings of the international conference on dependable systems and networks (DSN), IEEE, June 2005, pp 434–443. ISBN 0-7695-2282-3. doi:10.1109/DSN.2005.62
Oh N, Shirvani P, McCluskey E (2002) Error detection by duplicated instructions in super-scalar processors. IEEE Trans Reliab 51(1):63–75, March 2002. ISSN 0018-9529. doi:10.1109/24.994913
Reis G, Chang J, Vachharajani N, Rangan R, August D (2005) SWIFT: software implemented fault tolerance. In: Proceedings of the international symposium on code generation and optimization (CGO), IEEE Computer Society, March 2005, pp 243–254. ISBN 0-7695-2298-X. doi:10.1109/CGO.2005.34
Khudia D, Wright G, Mahlke S (2012) Efficient soft error protection for commodity embedded microprocessors using profile information. ACM SIGPLAN Not (LCTES) 47(5):99–108, June 2012. ISSN 0362-1340. doi:10.1145/2345141.2248433
Gomaa M, Scarbrough C, Vijaykumar T, Pomeranz I (2003) Transient-fault recovery for chip multiprocessors. In: Proceedings of the annual international symposium on computer architecture (ISCA), IEEE, June 2003, pp 98–109. doi:10.1109/ISCA.2003.1206992
Mukherjee S, Kontz M, Reinhardt S (2002) Detailed design and evaluation of redundant multi-threading alternatives. In: Proceedings of the annual international symposium on computer architecture (ISCA), IEEE, May 2002, pp 99–110. doi:10.1109/ISCA.2002.1003566
Smolens J, Gold B, Falsafi B, Hoe J (2006) Reunion: complexity-effective multicore redundancy. In: Proceedings of the annual IEEE/ACM international symposium on microarchitecture (MICRO), IEEE Computer Society, December 2006, pp 223–234. doi:10.1109/MICRO.2006.42
Jafri S, Piestrak S, Sentieys O, Pillement S (2010) Design of a fault-tolerant coarse-grained reconfigurable architecture: a case study. In: Proceedings of the international symposium on quality electronic design (ISQED), IEEE 2010, pp 845–852. doi:10.1109/ISQED.2010.5450481
Pillement S, Sentieys O, David R (2008) Dart: a functional-level reconfigurable architecture for high energy efficiency. EURASIP J Embed Syst 2008:5. ISSN 1687–3955: doi:10.1155/2008/562326
Azeem MM, Piestrak SJ, Sentieys O, Pillement S (2011) Error recovery technique for coarse-grained reconfigurable architectures. In: Proceedings of the IEEE 14th international symposium on design and diagnostics of electronic circuits systems (DDECS), April 2011, pp 441–446. doi:10.1109/DDECS.2011.5783133
Schweizer T, Schlicker P, Eisenhardt S, Kuhn T, Rosenstiel W (2011) Low-cost tmr for fault-tolerance on coarse-grained reconfigurable architectures. In Proceedings of the international conference on reconfigurable computing and FPGAs (ReConFig), IEEE, November 2011, pp 135–140. doi:10.1109/ReConFig.2011.57
Schweizer T, Kuster A, Eisenhardt S, Kuhn T, Rosenstiel W (2012) Using run-time reconfiguration to implement fault-tolerant coarse grained reconfigurable architectures. In Proceedings of the IEEE international parallel and distributed processing symposium workshops & PhD Forum (IPDPSW), IEEE, May 2012, pp 320–32. ISBN 978-1-4673-0974-5. doi:10.1109/IPDPSW.2012.39
Alnajiar D, Ko Y, Imagawa T, Konoura H, Hiromoto M, Mitsuyama Y, Hashimoto M, Ochi H, Onoye T (2009) Coarse-grained dynamically reconfigurable architecture with flexible reliability. In Proceedings of the international conference on field programmable logic and applications (FPL), IEEE, August 2009, pp 186–192. doi:10.1109/FPL.2009.5272317
Imagawa T, Tsutsui H, Ochi H, Sato T. A cost-effective selective tmr for heterogeneous coarse-grained reconfigurable architectures based on dfg-level vulnerability analysis. In: Proceedings of the conference on design, automation and test in Europe (DATE), DATE ’13, San Jose, CA, USA, 2013, pp 701–706. EDA Consortium. ISBN 978-1-4503-2153-2
Gong C, Melhem R, Gupta R (1996) Loop transformations for fault detection in regular loops on massively parallel systems. IEEE Trans. Parallel Distrib. Syst. 7(12): 1238–1249. ISSN 1045-9219. doi:10.1109/71.553273
Han K, Lee G, Choi K (2014) Software-level approaches for tolerating transient faults in a coarse-grainedreconfigurable architecture. IEEE Trans Dependable Secure Comput 11(4): 392–398. ISSN 1545-5971. doi:10.1109/TDSC.2013.54
Hu J, Li F, Degalahal V, Kandemir M, Vijaykrishnan N, Irwin M (2009) Compiler-assisted soft error detection under performance and energy constraints in embedded systems. ACM Trans Embed Comput Syst (TECS) 8(4):27:1–27:30. ISSN 1539-9087. doi:10.1145/1550987.1550990
Bolchini C (2003) A software methodology for detecting hardware faults in VLIW data paths. IEEE Trans Reliab 52(4):458–468. ISSN 0018-9529. doi:10.1109/TR.2003.821935
Pillai A, Zhang W, Kagaris D (2007) Detecting vliw hard errors cost-effectively through a software-based approach. In: Proceedings of the international conference on advanced information networking and applications workshops (AINAW), IEEE, vol 1, May 2007, pp 811–815. ISBN 978-0-7695-2847-2. doi:10.1109/AINAW.2007.152
Yao J, Saito M, Okada S, Kobayashi K, Nakashima Y (2014) Erela: A low-power reliable coarse-grained reconfigurable architecture processor and its irradiation tests. IEEE Trans Nucl Sci 61(6):3250–3257. ISSN 0018-9499. doi:10.1109/TNS.2014.2367541
Yao J, Nakashima Y, Saito M, Hazama Y, Yamanaka R (2014) A flexible, self-tuning, fault-tolerant functional unit array processor. IEEE Micro 34(6):54–63. ISSN 0272-1732. doi:10.1109/MM.2014.92
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media Singapore
About this chapter
Cite this chapter
Lari, V. (2016). On-Demand Fault Tolerance on Massively Parallel Processor Arrays. In: Invasive Tightly Coupled Processor Arrays. Computer Architecture and Design Methodologies. Springer, Singapore. https://doi.org/10.1007/978-981-10-1058-3_4
Download citation
DOI: https://doi.org/10.1007/978-981-10-1058-3_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-1057-6
Online ISBN: 978-981-10-1058-3
eBook Packages: EngineeringEngineering (R0)