Advertisement

Design Automation for Embedded Systems

, Volume 16, Issue 4, pp 189–220 | Cite as

A framework for reliability-aware design exploration on MPSoC based systems

  • Jia Huang
  • Andreas Raabe
  • Kai Huang
  • Christian Buckl
  • Alois Knoll
Article

Abstract

Applying system-level fault-tolerant techniques such as active redundancy is a promising way to enhance the system reliability for safety-related applications. Embedded system design using active redundancy is a challenging task that involves solving two major problems, namely finding the optimal redundancy configuration and mapping/scheduling of the application (including the redundant components) to the platform under timing and reliability constraints. This paper presents a framework for automatic synthesis of fault-tolerant designs on multiprocessor platforms. The core of the framework consists of: (1) a reliability analysis, that computes the system-level reliability in the presence spatial and temporal redundancy, and (2) an optimization approach for reliability-aware design space exploration. The proposed approach considers both transient and permanent faults and is among the first to support system design using imperfect fault detectors. The framework takes an application model, a platform model and a set of application requirements as input, and generates the recommended design parameters, including task-to-processor binding, task schedule and the selection/placement of redundancy. The effectiveness of our approach is illustrated using several case studies.

Keywords

Reliability Fault-tolerance Design exploration Real-time systems 

Notes

Acknowledgements

This work has been supported in part by the European research project ACROSS under the Grant Agreement ARTEMIS-2009-1-100208 and the German BMBF projects ECU (grant number: 13N11936) and Car2X(grant number: 13N11933).

References

  1. 1.
  2. 2.
    Axer P, Sebastian M, Ernst R (2011) Reliability analysis for mpsocs with mixed-critical, hard real-time constraints. In: International conference on hardware/software codesign and system synthesis (CODES+ISSS), pp 149–158 Google Scholar
  3. 3.
    Baumann R (2002) The impact of technology scaling on soft error rate performance and limits to the efficacy of error correction. In: International electron devices meeting (IEDM) Google Scholar
  4. 4.
    Benoit A, Canon LC, Jeannot E, Robert Y (2011) Reliability of task graph schedules with transient and fail-stop failures: complexity and algorithms. J Sched 15(5):615–627 CrossRefMathSciNetGoogle Scholar
  5. 5.
    Birolini A (2004) Reliability engineering—theory and practice. Springer, Berlin Google Scholar
  6. 6.
    Borkar S (2005) Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. Micro, IEEE 25(6):10–16 CrossRefGoogle Scholar
  7. 7.
    Degraeve R, Groeseneken G, Bellens R, Depas M, Maes H (1995) A consistent model for the thickness dependence of intrinsic breakdown in ultra-thin oxides. In: Electron devices meeting Google Scholar
  8. 8.
    Feldmann R, Haubelt C, Monien B, Teich J (2003) Fault tolerance analysis of distributed reconfigurable systems using sat-based techniques. In: International conference on field programmable logic and applications Google Scholar
  9. 9.
    Fohler G (1997) Adaptive fault-tolerance with statically scheduled real-time systems. In: Proceedings ninth euromicro workshop on real-time systems, 1997 Google Scholar
  10. 10.
    Gall M, Capasso C, Jawarani D, Hernandez R, Kawasaki H, Ho PS (2001) Statistical analysis of early failures in electromigration. J Appl Phys 8(2):732–740 CrossRefGoogle Scholar
  11. 11.
  12. 12.
    Girault A, Kalla H (2009) A novel bicriteria scheduling heuristics providing a guaranteed global system failure rate. IEEE Trans Dependable Secure Comput 6(4):241–254 CrossRefGoogle Scholar
  13. 13.
    GlaßM, Lukasiewycz M, Reimann F, Haubelt C, Teich J (2008) Symbolic Reliability Analysis and optimization of ECU Networks. In: Design, automation and test in Europe (DATE), pp 158–163 Google Scholar
  14. 14.
    GlaßM, Lukasiewycz M, Streichert T, Haubelt C, Teich J (2007) Reliability-aware system synthesis. In: Design, automation and test in Europe (DATE), pp 409–414 Google Scholar
  15. 15.
    Hartman AS, Thomas DE, Meyer BH (2010) A case for lifetime-aware task mapping in embedded chip multiprocessors. In: International conference on hardware/software codesign and system synthesis (CODES+ISSS), pp 145–154 Google Scholar
  16. 16.
    Huang J, Blech JO, Raabe A, Buckl C, Knoll A (2011) Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems. In: International conference on hardware-software codesign and system synthesis (CODES+ISSS), Taipei, Taiwan, pp 247–256 Google Scholar
  17. 17.
    Huang J, Huang K, Raabe A, Buckl C, Knoll A (2012) Towards fault-tolerant embedded systems with imperfect fault detection. In: 49th design automation conference (DAC), San Francisco, CA, pp 188–196 Google Scholar
  18. 18.
    Huang L, Yuan F, Xu Q (2009) Lifetime reliability-aware task allocation and scheduling for mpsoc platforms. In: Proceedings of the conference on design, automation and test in Europe (DATE), pp 51–56 Google Scholar
  19. 19.
    Izosimov V, Polian I, Pop P, Eles P, Peng Z (2009) Analysis and optimization of fault-tolerant embedded systems with hardened processors. In: Design, automation and test in Europe (DATE), pp 682–687 Google Scholar
  20. 20.
    Izosimov V, Pop P, Eles P, Peng Z (2005) Design optimization of time-and cost-constrained fault-tolerant distributed embedded systems. In: Design, automation and test in Europe (DATE), pp 864–869 CrossRefGoogle Scholar
  21. 21.
    Izosimov V, Pop P, Eles P, Peng Z (2006) Synthesis of fault-tolerant schedules with transparency/performance trade-offs for distributed embedded systems. In: Design, automation and test in Europe (DATE), pp 706–711 Google Scholar
  22. 22.
    Kandasamy N, Hayes J, Murray B (2003) Transparent recovery from intermittent faults in time-triggered distributed systems. IEEE Trans Comput 52(2):113–125 CrossRefGoogle Scholar
  23. 23.
    LaFrieda C, Ipek E, Martinez J, Manohar R (2007) Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: International conference on dependable systems and networks (DSN), pp 317–326 Google Scholar
  24. 24.
    Lala J, Harper R (1994) Architectural principles for safety-critical real-time applications. Proc IEEE 82(1):25–40 CrossRefGoogle Scholar
  25. 25.
    Lee C, Kim H, Park Hw, Kim S, Oh H, Ha S (2010) A task remapping technique for reliable multi-core embedded systems. In: International conference on Hardware/Software codesign and system synthesis (CODES+ISSS), pp 307–316 Google Scholar
  26. 26.
    Lifa A, Eles P, Peng Z, Izosimov V (2010) Hardware/software optimization of error detection implementation for real-time embedded systems. In: International conference on hardware/software codesign and system synthesis (CODES+ISSS), pp 41–50 Google Scholar
  27. 27.
    Lukasiewycz M, GlaßM, Haubelt C, Teich J (2007) Sat-decoding in evolutionary algorithms for discrete constrained optimization problems. In: IEEE Congress on evolutionary computation Google Scholar
  28. 28.
    Lukasiewycz M, GlaßM, Reimann F, Teich J (2011) Opt4j: a modular framework for meta-heuristic optimization. In: Proceedings of the 13th annual conference on genetic and evolutionary computation (GECCO), pp 1723–1730 CrossRefGoogle Scholar
  29. 29.
    Lyle G, Chen S, Pattabiraman K, Kalbarczyk Z, Iyer R (2010) An end-to-end approach for the automatic derivation of application-aware error detectors. In: IEEE/IFIP international conference on dependable systems networks (DSN), pp 584–589 Google Scholar
  30. 30.
    Meyer BH, Hartman AS, Thomas DE (2010) Cost-effective slack allocation for lifetime improvement in noc-based mpsocs. In: Proceedings of the conference on design, automation and test in Europe (DATE), pp 1596–1601 Google Scholar
  31. 31.
    Mitra S, Saxena N, McCluskey E (2000) Common-mode failures in redundant vlsi systems: a survey. IEEE Trans Reliab 49(3):285–295 CrossRefGoogle Scholar
  32. 32.
    Mitra S, Saxena N, McCluskey E (2002) A design diversity metric and analysis of redundant systems. IEEE Trans Comput 51(5):498–510 CrossRefGoogle Scholar
  33. 33.
    Mukherjee SS, Weaver C, Emer J, Reinhardt SK, Austin T (2003) A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: Proceedings of IEEE/ACM international symposium on microarchitecture (MICRO) Google Scholar
  34. 34.
    Obermaisser R, Hoftberger O (2011) Fault containment in a reconfigurable multi-processor system-on-a-chip. In: Industrial electronics (ISIE), 2011 IEEE international symposium on Google Scholar
  35. 35.
    Pattabiraman K, Kalbarczyk Z, Iyer R (2011) Automated derivation of application-aware error detectors using static analysis: the trusted illiac approach. IEEE Trans Dependable Secure Comput 8(1):44–57 CrossRefGoogle Scholar
  36. 36.
    Pinello C, Carloni LP, Sangiovanni-Vincentelli AL (2004) Fault-tolerant deployment of embedded software for cost-sensitive real-time feedback-control applications. In: Design, automation and test in Europe (DATE), pp 1164–1169 Google Scholar
  37. 37.
    Pop P, Izosimov V, Eles P, Peng Z (2009) Design optimization of time- and cost-constrained fault-tolerant embedded systems with checkpointing and replication. IEEE Trans Very Large Scale Integr (VLSI) Syst 17(3):389–402 CrossRefGoogle Scholar
  38. 38.
    Pop P, Poulsen KH, Izosimov V, Eles P (2007) Scheduling and voltage scaling for energy/reliability trade-offs in fault-tolerant time-triggered embedded systems. In: International conference on Hardware/Software codesign and system synthesis (CODES+ISSS), pp 233–238 Google Scholar
  39. 39.
    Pradhan DK (1996) Fault-tolerant computer system design Google Scholar
  40. 40.
    Qin X, Jiang H (2006) A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems. Parallel Comput 32(5–6):331–356 CrossRefMathSciNetGoogle Scholar
  41. 41.
    Reimann F, GlaßM, Lukasiewycz M, Haubelt C, Keinert J, Teich J (2008) Symbolic voter placement for dependability-aware system synthesis. In: International conference on hardware/software codesign and system synthesis (CODES+ISSS), pp 237–242 Google Scholar
  42. 42.
    Saraswat PK, Pop P, Madsen J (2010) Task mapping and bandwidth reservation for mixed hard/soft fault-tolerant embedded systems. In: IEEE real-time and embedded technology and applications symposium (RTAS), pp 89–98 Google Scholar
  43. 43.
    Schiffel U, Schmitt A, Süßkraut M, Fetzer C (2010) Software-implemented hardware error detection: costs and gains. In: Third international conference on dependability Google Scholar
  44. 44.
    Shatz SM, Wang JP (1989) Models and algorithms for reliability-oriented task-allocation in redundant distributed-computer systems. IEEE Trans Reliab 38:16–27 CrossRefGoogle Scholar
  45. 45.
    Srinivasan J, Adve S, Bose P, Rivers J (2004) The impact of technology scaling on lifetime reliability. In: IEEE/IFIP international conference on dependable systems networks (DSN), pp 177–186 Google Scholar
  46. 46.
    Storey N (1996) Safety-critical computer systems. Addison Wesley/Longman, Reading Google Scholar
  47. 47.
    Thiele L, Bacivarov I, Haid W, Huang K (2007) Mapping applications to tiled multiprocessor embedded systems. In: International conference on application of concurrency to system design (ACSD), pp 29–40 CrossRefGoogle Scholar
  48. 48.
    Xiang Y, Chantem T, Dick RP, Hu XS, Shang L (2010) System-level reliability modeling for mpsocs. In: International conference on Hardware/Software codesign and system synthesis (CODES+ISSS), pp 297–306 Google Scholar
  49. 49.
    Xie Y, Li L, Kandemir M, Vijaykrishnan N, Irwin M (2004) Reliability-aware co-synthesis for embedded systems. In: IEEE international conference on application-specific systems, architectures and processors (ASAP), pp 41–50 Google Scholar
  50. 50.
    Yang C, Orailoglu A (2007) Predictable execution adaptivity through embedding dynamic reconfigurability into static mpsoc schedules. In: International conference on Hardware/Software codesign and system synthesis (CODES+ISSS), pp 15–20 Google Scholar
  51. 51.
    Yang C, Orailoglu A (2009) Towards no-cost adaptive mpsoc static schedules through exploitation of logical-to-physical core mapping latitude. In: Design, automation and test in Europe (DATE), pp 63–68 Google Scholar
  52. 52.
    Zhao B, Aydin H, Zhu D (2009) Enhanced reliability-aware power management through shared recovery technique. In: International conference on computer-aided design (ICCAD), pp 63–70 Google Scholar
  53. 53.
    Zhu C, Gu ZP, Dick RP, Shang L (2007) Reliable multiprocessor system-on-chip synthesis. In: International conference on Hardware/Software codesign and system synthesis (CODES+ISSS), pp 239–244 Google Scholar
  54. 54.
    Zhu D, Aydin H (2006) Energy management for real-time embedded systems with reliability requirements. In: IEEE/ACM international conference on computer-aided design (ICCAD), pp 528–534 Google Scholar
  55. 55.
    Zhu D, Aydin H (2009) Reliability-aware energy management for periodic real-time tasks. IEEE Trans Comput 99:1382–1397 MathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Jia Huang
    • 1
  • Andreas Raabe
    • 1
  • Kai Huang
    • 2
  • Christian Buckl
    • 1
  • Alois Knoll
    • 2
  1. 1.fortiss GmbHMunichGermany
  2. 2.TU MünchenGarching bei MunichGermany

Personalised recommendations