Improving Reliability of Multi-/Many-Core Processors by Using NMR-MPar Approach

  • Vanessa VargasEmail author
  • Pablo Ramos
  • Jean-Francois Méhaut
  • Raoul Velazco


The new trend in computing systems is providing solutions by using multicore and many-core processors. COTS processors are preferred because they offer a high performance with low-power consumption within an affordable price. Lately these devices have been used in High Performance Computing systems due to their massive parallelism and low-power budget. For the last decade, industrial and academic partners have worked together to overcome with dependability issues to extend their usage in embedded systems. Despite of multiple proposals for improving the multi-core reliability, their use is not yet validated for critical tasks. This chapter describes a new fault-tolerance approach called NMR-MPar which is based on N-Modular Redundancy and M-Partitions to improve the reliability of applications running on these devices. The evaluation of the effectiveness of the NMR-MPar approach on two complementary benchmark applications running on the 28 nm CMOS MPPA-256 many-core processor has shown the possibility to consider this approach for mixed-criticality systems. Finally, this chapter analyses the overhead of the approach in terms of power consumption and energy.



This work was supported in part by the Universidad de las Fuerzas Armadas ESPE and by the Secretaria de Educación Superior, Ciencia, Tecnología e Innovación del Ecuador (SENESCYT) through the grant PIC-2017-EXT-004 and STIC—AmSud (Science et Tech-nologie de l’Information et de la Communication en Amrique du Sud)—Energy-aware Scheduling and Fault Tolerance Techniques for the Exascale Era (EnergySFE) Project PIC-16-ESPE-STIC-001, and by the French authorities through the “Investissements d’Avenir” program (CAPACITES project). The authors thank Stephané Gailhard from the Societé Kalray for his valuable contribution to solving the MPPA programming issues.


  1. 1.
    S. Saidi, R. Ernst, S. Uhrig, H. Theiling, B. Dupont de Dinechin, The shift to multicores in real-time and safety-critical systems, in Proceeding of the Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), October 2015, pp. 220–229Google Scholar
  2. 2.
    Across, Advanced cockpit for reduction of stress and workload (2016),
  3. 3.
    MultiPARTES, Multi-cores Partitioning for Trusted Embedded Systems (2016),
  4. 4.
    D.A.I.T.V. Kalray, Airbus. Calcul parallèle pour applications critiques en temps et sureté., Accessed 16 Mar 2018
  5. 5.
    E. Normand, Single-event effects in avionics. IEEE Trans. Nucl. Sci. 43(2), 461–474 (1996)CrossRefGoogle Scholar
  6. 6.
    G.H. Asadi, S. Vilas, M.B. Tahoori, D. Kaeli. Balancing performance and reliability in the memory hierarchy. in Proceeding of Performance Analysis of Systems and Software, pp. 269–279, March 2005Google Scholar
  7. 7.
    Y. Cai, M.T. Schmitz, A. Ejlali, B.M. Al-Hashimi, S.M. Reddy. Cache size selection for performance, energy and reliability of time-constrained systems, in Asia and South Pacific Conference on Design Automation, January 2006, pp. 6Google Scholar
  8. 8.
    H. Naeimi, C. Augustine, A. Raychowdhury, S. Lu, J. Tschanz, Sttram scaling and retention failure. Intel Technol. J. 17(1), 54–75 (2013)Google Scholar
  9. 9.
    S. Guertin. Initial SEE Test of Maestro. Pasadena, CA: Jet Propulsion Laboratory, National Aeronautics and Space Administration, July 2012Google Scholar
  10. 10.
    D.A.G. Oliveira, P. Rech, H.M. Quinn, T.D. Fairbanks, L. Monroe, S.E. Michalak, C. Anderson-Cook, P.O.A. Navaux, L. Carro, Modern GPUs radiation sensitivity evaluation and mitigation through duplication with comparison. IEEE Trans. Nucl. Sci. 61(6), 3115–3122 (2014)CrossRefGoogle Scholar
  11. 11.
    P. Ramos, V. Vargas, M. Baylac, F. Villa, S. Rey, J.A. Clemente, N.E. Zergainoh, J.F. Mehaut, R. Velazco, Evaluating the SEE sensitivity of a 45nm SOI multi-core processor due to 14 MeV neutrons. IEEE Trans. Nucl. Sci. 63(4), 2193–2200 (2016)CrossRefGoogle Scholar
  12. 12.
    S.S. Stolt, E. Normand, A multicore server SEE cross section model. IEEE Trans. Nucl. Sci. 59(6), 2803–2810 (2012)CrossRefGoogle Scholar
  13. 13.
    V. Vargas, P. Ramos, V. Ray, C. Jalier, R. Stevens, B. Dupont de Dinechin, M. Baylac, F. Villa, S. Rey, N.E. Zergainoh, J.F. Mehaut, R. Velazco, Radiation experiments on a 28nm single-chip many-core processor and SEU error-rate prediction. IEEE Trans. Nucl. Sci. 99(4), 1–8 (2016)Google Scholar
  14. 14.
    A. Vajda, Multi-core and many-core processor architectures, in Programming Many-Core Chips, (Springer, New York, 2011), pp. 9–43CrossRefGoogle Scholar
  15. 15.
    Freescale. Running AMP, SMP or BMP Mode for Multicore Embedded Systems, 2012Google Scholar
  16. 16.
    IEEE Computer Society 1003.1-2001 IEEE Standard for IEEE Information Technology Portable Operating System Interface (POSIX(R)) (2001),
  17. 17.
    S. Kim, A.K. Somani, Area efficient architectures for information integrity in cache memories, in Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367), 1999, pp. 246–255CrossRefGoogle Scholar
  18. 18.
    M. Sugihara, T. Ishihara, K. Murakami. Task scheduling for reliable cache architectures of multiprocessor systems, in 2007 Design, Automation Test in Europe Conference Exhibition, April 2007, pp. 1–6Google Scholar
  19. 19.
    W. Zhang, Replication cache: a small fully associative cache to improve data cache reliability. IEEE Trans. Comput. 54(12), 1547–1555 (2005)CrossRefGoogle Scholar
  20. 20.
    W. Zhang, S. Gurumurthi, M. Kandemir, A. Sivasubramaniam, Icr: in-cache replication for enhancing data cache reliability, in Proceedings of the 2003 International Conference on Dependable Systems and Networks, June 2003, p. 291–300Google Scholar
  21. 21.
    A. Sundaram, A. Aakel, D. Lockhart, D. Thaker, D. Franklin, Efficient fault tolerance in multi-media applications through selective instruction replication, in Proceedings of the 2008 Workshop on Radiation Effects and Fault Tolerance in Nanometer Technologies, WREFT ’08, (ACM, New York, NY, 2008), pp. 339–346CrossRefGoogle Scholar
  22. 22.
    G. Memik, M. Kandemir, O. Ozturk, Increasing register file immunity to transient errors. Design Automat. Test Europe 1, 586–591 (2005)CrossRefGoogle Scholar
  23. 23.
    H. Tabkhi, Application-specific power-efficient approach for reducing register file vulnerability, in 2012 Design, Automation Test in Europe Conference Exhibition (DATE), March 2012, pp. 574–577Google Scholar
  24. 24.
    R. Lyons, W. Vanderkulk, The use of triple modular redundancy to improve computer reliability. IBM J. Res. Dev. 6(2), 200–209 (1962)CrossRefGoogle Scholar
  25. 25.
    J.P. Walters, R. Kost, K. Singh, J. Suh, S.P. Crago, Software-based fault tolerance for the Maestro many-core processor, in Proceedings of 2011 Aerospace Conference, March 2011Google Scholar
  26. 26.
    Z. Basile, C. Kalbarczyk, R.K. Iyer, Active replication of multithreaded applications. IEEE Trans. Parallel Distr. Syst. 17(5), 448–465 (2006)CrossRefGoogle Scholar
  27. 27.
    S. Mukherjee, M. Kontz, S. Reinhardt, Detailed design and evaluation of redundant multi-threading alternatives, in Proceedings 29th Annual International Symposium on Computer Architecture, 2002, pp. 99–110Google Scholar
  28. 28.
    H. Mushtaq, Z. Al-Ars, K. Bertels, Efficient software-based fault tolerance approach on multicore platforms, in Proceedings of Design, Automation & Test in Europe Conference, March 2013, pp. 921–926Google Scholar
  29. 29.
    S. Reinhardt, S. Mukherjee, Transient fault detection via simultaneous multithreading, in Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201), June 2000, pp. 25–36Google Scholar
  30. 30.
    T.N. Vijaykumar, I. Pomeranz, K. Cheng, Transient fault recovery using simultaneous multithreading, in Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002, pp. 87–98Google Scholar
  31. 31.
    A. Holler, T. Rauter, J. Iber, G. Macher, C. Kreiner. Software-Based Fault Recovery via Adaptive Diversity for Reliable COTS Multi-Core Processors, p. 1–6, 2016Google Scholar
  32. 32.
    M.S. Alhakeem, P. Munk, R. Lisicki, H. Parzyjegla, H. Parzyjegla, G. Muehl. A frame-work for adaptive software-based reliability in cots many-core processors, in ARCS 2015—The 28th International Conference on Architecture of Computing Systems. Proceedings, March 2015, pp. 1–4Google Scholar
  33. 33.
    E.P. Kim, N.R. Shanbhag, Soft n-modular redundancy. IEEE Trans. Comput. 61(3), 323–336 (2012)MathSciNetCrossRefGoogle Scholar
  34. 34.
    C. Bolchini, A. Miele, D. Sciuto. An adaptive approach for online fault management in many-core architectures. in 2012 Design, Automation Test in Europe Conference Exhibition (DATE), p. 1429–1432, March 2012Google Scholar
  35. 35.
    A. Shye, J. Blomstedt, T. Moseley, V. Janapa Reddi, D.A. Connors, PLR: a software approach to transient fault tolerance for multicore architectures. IEEE Trans. Depend Sec. Comput. 6(2), 135–148 (2009)CrossRefGoogle Scholar
  36. 36.
    I.G. Spec. Rtca/do-297 - integrated modular avionics (ima) development guidance and certification considerations., Accessed 16 Mar 2018
  37. 37.
    A. Lofwenmark, S. Nadjm-Tehrani, Challenges in future avionic systems on multi-core platforms., in 2014 IEEE International Symposium on Software Reliability Engineering Work-shops, November 2014, pp. 115–119Google Scholar
  38. 38.
    M.S. Mollison, J.P. Erickson, J.H. Anderson, S.K. Baruah, J.A. Scoredos, Mixed-criticality real-time scheduling for multicore systems, in 2010 10th IEEE International Conference on Computer and Information Technology, June 2010, pp. 1864–1871Google Scholar
  39. 39.
    M. Panic, E. Quinones, P. G. Zavkov, C. Hernandez, J. Abella, F.J. Cazorla. Parallel many-core avionics systems, in 2014 International Conference on Embedded Software (EMSOFT), October 2014, pp. 1–10Google Scholar
  40. 40.
    S. Trujillo, A. Crespo, A. Alonso, J. Pérez, Multipartes: multi-core partitioning and virtualization for easing the certification of mixed-criticality systems. Microprocess. Microsyst. 38(8, Part B), 921–932 (2014)CrossRefGoogle Scholar
  41. 41.
    M. Shooman, Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design (John Wiley & Sons, Inc., New York, NY, 2002)CrossRefGoogle Scholar
  42. 42.
    I. Koren, S.Y.H. Su, Reliability analysis of n-modular redundancy systems with intermit-tent and permanent faults. IEEE Trans. Comput. 28(7), 514–520 (1979)MathSciNetCrossRefGoogle Scholar
  43. 43.
    Top500, Top 500 supercomputer list (2017),, Accessed 16 Mar 2018
  44. 44.
    E. Francesquini, M. Castro, P. Penna, F. Dupros, H. Freitas, P. Navaux, J.F. Mehaut, On the energy efficiency and performance of irregular application executions on multicore, NUMA and manycore platforms. J Parallel Distr Com 76, 32–48 (Feb. 2015)CrossRefGoogle Scholar
  45. 45.
    Kalray. MPPA ACCESSCORE V1.4 Introductory Manual, 2015Google Scholar
  46. 46.
    B.D. de Dinechin, P.G. de Massas, G. Lager, C. Leger, B. Orgogozo, J. Reybert, T. Strudel. A distributed run-time environment for the kalray mppa-256 integrated manycore processor. Procedia Computer Science. 2013 International Conference on Computational Science, 18:1654– 1663, 2013CrossRefGoogle Scholar
  47. 47.
    D.L. Applegate, R.E. Bixby, V. Chvatal, W.J. Cook, The Traveling Salesman Problem: A Computational Study (Princeton University Press, Princeton, NJ, 2007), pp. 49–53zbMATHGoogle Scholar
  48. 48.
    D. Johnson, C. Paadimitriou, Computational complexity, in Wiley Series in Discrete Mathematics and Optimization, (Wiley and Sons, Chichester, 1995), pp. 37–85Google Scholar
  49. 49.
    V. Vargas, P. Ramos, J. Mehaut, R. Velazco, Nmr-mpar: a fault-tolerance approach for multi-core and many-core processors. Appl. Sci. 8(3), 465 (2018)CrossRefGoogle Scholar
  50. 50.
    P. Ramos, V. Vargas, M. Baylac, F. Villa, S. Rey, J.A. Clemente, N.E. Zergainoh, R. Ve-lazco, Sensitivity to neutron radiation of a 45nm SOI multi-core processor, in Proceedings of Radia-tion Effects on Components and Systems, September 2015,pp. 135–138Google Scholar
  51. 51.
    V. Vargas, P. Ramos, W. Mansour, R. Velazco, N.E. Zergainoh, J.F. Mehaut, Preliminary results of SEU fault-injection on multicore processors in AMP mode, in Proceedings of IEEE 20th International On-Line Testing Symposium (IOLTS), September 2014, pp. 194–197Google Scholar
  52. 52.
    V. Vargas, P. Ramos, R. Velazco, J.F. Mehaut, N.E. Zergainoh, Evaluating SEU fault-injection on parallel applications implemented on multicore processors, in Proceedings of the 6th Latin American Symposium on Circuits & Systems (LASCAS), February 2015, pp. 181–184Google Scholar
  53. 53.
    P. Peronnard, R. Ecoffet, M. Pignol, D. Bellin, R. Velazco, Predicting the SEU error rate through fault injection for a complex microprocessor, in Proceedings of 2008 IEEE International Symposium on Industrial Electronics, p. 2288–2292, September 2008Google Scholar
  54. 54.
    V. Vargas, P. Ramos, J. Mehaut, R. Velazco, Swifi Fault injector for heterogeneous many-core processors. Pontificia Universidad Católica del Ecuador, ISSN: 2528-8156 (accepted), 106, May 2018Google Scholar
  55. 55.
    C. Villalpando, D. Rennels, R. Some, M. Cabanas-Holmen, Reliable multicore processors for NASA space missions, in Proceeding of the Aerospace Conference, March 2011, pp. 1–12Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Vanessa Vargas
    • 1
    Email author
  • Pablo Ramos
    • 1
  • Jean-Francois Méhaut
    • 2
  • Raoul Velazco
    • 3
  1. 1.Universidad de las Fuerzas Armadas ESPESangolquiEcuador
  2. 2.Université de Grenoble AlpesSaint Martin d’HeresFrance
  3. 3.Centre Nationale Recherche Scientifique (CNRS)GrenobleFrance

Personalised recommendations