Real-Time Systems

, Volume 55, Issue 4, pp 889–924 | Cite as

Practical task allocation for software fault-tolerance and its implementation in embedded automotive systems

  • Anand BhatEmail author
  • Soheil Samii
  • Ragunathan Rajkumar


Due to the advent of active safety features and automated driving capabilities, the complexity of embedded computing systems within automobiles continues to increase. Such advanced driver assistance systems (ADAS) are inherently safety-critical and must tolerate failures in any subsystem. However, fault-tolerance in safety-critical systems has been traditionally supported by hardware replication, which is prohibitively expensive in terms of cost, weight, and size for the automotive market. Recent work has studied the use of software-based fault-tolerance techniques that utilize task-level hot and cold standbys to tolerate fail-stop processor and task failures. The benefit of using standbys is maximal when a task and any of its standbys obey the placement constraint of not being co-located on the same processor. We propose a new heuristic based on a “tiered” placement constraint, and show that our heuristic produces a better task assignment that saves at least one processor up to 40% of the time relative to the best known heuristic to date. We then introduce a task allocation algorithm that, for the first time to our knowledge, leverages the run-time attributes of cold standbys. Our empirical study finds that our heuristic uses no more than one additional processor in most cases relative to an optimal allocation that we construct for evaluation purposes using a creative technique. We also extend our heuristic to support mixed-criticality systems which allow for overload operation. We have designed and implemented our software fault-tolerance framework in AUTOSAR, an automotive industry standard. We use this implementation to provide an experimental evaluation of our task-level fault-tolerance features. Finally, we present an analysis of the worst-case behavior of our task recovery features.


Real-time systems Automotive systems Fault tolerance Task allocation 


Supplementary material


  1. Avizienis A et al (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE transactions on dependable and secure computingGoogle Scholar
  2. Balasubramanian J et al. (2010) Middleware for resource-aware deployment and configuration of fault-tolerant real-time systems. In: RTAS ’10, pp 69–78Google Scholar
  3. Bhat A, Aoki S, Rajkumar R (2018) Tools and methodologies for autonomous driving systems. In: Proceedings of the IEEE vol 106, pp 1700–1716Google Scholar
  4. Bhat A, Samii S, Rajkumar RR (2018) Recovery time considerations in real-time systems employing software fault tolerance. In: 30th Euromicro Conference on Real-Time Systems (ECRTS 2018) (S. Altmeyer, ed.), vol. 106 of Leibniz International Proceedings in Informatics (LIPIcs), (Dagstuhl, Germany). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, pp 23:1–23:22Google Scholar
  5. Bouyssounouse B, Sifakis J (2005) Tools for verification and validation. Springer, Berlin, pp 72–84Google Scholar
  6. Chen J et al (2007) Real-time task replication for fault tolerance in identical multiprocessor systems. In: Proceedings of the 13th IEEE real time and embedded technology and applications symposium, RTAS ’07, pp 249–258Google Scholar
  7. Cristian F (1991) Reaching agreement on processor-group membership in synchronous distributed systems. Distrib Comput 4(4):175–187CrossRefGoogle Scholar
  8. Davis RI, Burns A, Bril RJ, Lukkien JJ (2007) Controller area network (can) schedulability analysis: refuted, revisited and revised. Real-Time Syst 35:239–272CrossRefGoogle Scholar
  9. Felber PNP (2004) Experiences, strategies, and challenges in building fault-tolerant CORBA systems. IEEE Trans Comput. 53(5):497–511CrossRefGoogle Scholar
  10. Gopalakrishnan S, Caccamo M (2006) Task partitioning with replication upon heterogeneous multiprocessor systems. RTAS 06:199–207Google Scholar
  11. Huang H, Gill C, Lu C (2012) Implementation and evaluation of mixed-criticality scheduling approaches for periodic tasks. In: 2012 IEEE 18th Real Time and Embedded Technology and Applications Symposium, pp 23–32Google Scholar
  12. Johnson D (1973) Near optimal allocation algorithms. Ph.D. Dissertation, MIT, MAGoogle Scholar
  13. Kim J et al (2010) R-BATCH: task partitioning for fault-tolerant multiprocessor real-time systems. In: CIT 2010, Bradford, West Yorkshire, UK, June 29-July 1, 2010, pp 1872–1879Google Scholar
  14. Kim J et al (2012) Safer: system-level architecture for failure evasion in real-time applications. In: IEEE 33rd real-time systems symposium (RTSS), 2012Google Scholar
  15. Klobedanz K et al (2013) Embedded systems: design, analysis and verification. In: Proceedings of the 4th IFIP TC 10, IESS 2013, Paderborn, Germany, June 17-19, 2013. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 238–249Google Scholar
  16. Lakshmanan K, De Niz D, Rajkumar RR, Moreno G (2013) Overload provisioning in mixed-criticality cyber-physical systems. ACM Trans Embed Comput Syst 11:83:1–83:24Google Scholar
  17. Lakshmanan K, Niz DD, Rajkumar R, Moreno G (2010) Resource allocation in distributed mixed-criticality cyber-physical systems. In: 2010 IEEE 30th International Conference on Distributed Computing Systems, pp 169–178Google Scholar
  18. Leu K et al (2012) Generic reliability analysis for safety-critical flexray drive-by-wire systems. In: Connected Vehicles and Expo (ICCVE), 2012Google Scholar
  19. Narasimhan P et al (2005) MEAD: support for real-time fault-tolerant CORBA. Concurr Comp-Pract E 17(12):1527–1545CrossRefGoogle Scholar
  20. Niz D, Lakshmanan K, Rajkumar R (2009) On the scheduling of mixed-criticality real-time task sets. In: 2009 30th IEEE Real-Time Systems Symposium, pp 291–300Google Scholar
  21. Oh D, Baker T (1998) Utilization bounds for n-processor rate monotonic scheduling with static processor assignment. In: Real-Time System, pp vol 15, pp 183–192Google Scholar
  22. Phillips M, Narayanan V, Aine S, Likhachev M (2015) Efficient search with an ensemble of heuristics. In: Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15. AAAI Press, pp 784–791Google Scholar
  23. Pinello C et al (2008) Fault-tolerant distributed deployment of embedded control software. In: IEEE transactions on computer-aided design of integrated circuits and systems vol 27, pp 906–919Google Scholar
  24. Pop T, Pop P, Eles P, Peng Z, Andrei A (2006) Timing analysis of the flexray communication protocol. In: 18th Euromicro conference on real-time systems (ECRTS’06), pp 11–216Google Scholar
  25. Rajkumar R, Gagliardi M (1996) High availability in the real-time publisher/subscriber inter-process communication model. In: 17th IEEE Real-Time Systems Symposium, pp 136–141Google Scholar
  26. Ramamritham K (1995) Allocation and scheduling of precedence-related periodic tasks. IEEE Trans Parallel Distrib Syst 6:412–420CrossRefGoogle Scholar
  27. Samii S (2015) Ethernet TSN as enabling technology for ADAS and automated driving systems. In: IEEE-SA Ethernet and IP at Automotive Technology Day, Oct 2015Google Scholar
  28. Zhu P, Yang F, Tu G (2010) Fault-tolerant rate-monotonic compact-factor-driven scheduling in hard-real-time systems. Wuhan Univ J Nat Sci 15(3):217–221CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Electrical and Computer EngineeringCarnegie Mellon UniversityPittsburghUSA
  2. 2.General Motors R&DWarrenUSA
  3. 3.Electrical and Computer EngineeringCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations