Reliability-Aware Fault-Tolerant Scheduling

  • Guoqi Xie
  • Gang Zeng
  • Renfa Li
  • Keqin Li


Reliability is widely identified as an increasingly relevant issue on heterogeneous distributed cloud systems because processor failure affects the quality of service for users. Replication-based fault-tolerance is a common approach to satisfy the application’s reliability requirement. This chapter solves the problem of minimizing redundancy to satisfy reliability requirement for a parallel application on heterogeneous distributed cloud systems. In addition, this chapter also focuses on heterogeneous distributed embedded systems such as ACPS, which are safety critical systems. And response time is an another safety attribution on ACPS. So this chapter further solves the problem of cost optimization when satisfying safety requirement including reliability and response time requirement on heterogeneous distributed embedded systems such as APCS. We first propose the enough replication for redundancy minimization (ERRM) algorithm to satisfy an application’s reliability requirement, and then propose heuristic replication for redundancy minimization (HRRM) to satisfy an application’s reliability requirement with low time complexity. ERRM can generate the least redundancy followed by HRRM, and the state-of-the-art MaxRe and RR algorithm. In addition, HRRM implements approximate minimum redundancy with a short computation time. Considering that a minimum number of replicas does not necessarily lead to the minimum execution cost and shortest schedule length in a heterogeneous distributed cloud systems, we further propose the quantitative fault-tolerance with minimum execution cost (QFEC) & QFEC+ algorithms and the quantitative fault-tolerance with minimum schedule length (QFSL) & QFSL+ algorithms while satisfying the reliability requirement of the workflow. Next, we present a safety-aware fault-tolerant methodology towards the resource cost optimization for end-to-end functional safety computation on ACPS. The proposed design methodology involves early functional safety requirement verification and late resource cost design optimization. We first propose the functional safety requirement verification (FSRV) algorithm to verify the functional safety requirement consisting of reliability and response time requirements of the distributed automotive function for the early design phase. And then we propose the resource cost-aware fault-tolerant optimization (RCFO) algorithm to reduce the resource cost while satisfying the functional safety requirement of the function for the late design phase. Finally, this chapter presents different experiments toward different application environments such as CPCS and ACPS. We first do the experiments for the redundancy cost optimization on real and randomly generated parallel applications at different scales, parallelism to validate the performance of ERRM and HRRM on heterogeneous distributed systems. We then do the experiments for the execution cost and scheduling length optimization on heterogeneous distributed cloud systems to validate the efficiency of QFEC, QFEC+, QFSL and QFSL+. We finally do the experiments for the resource cost optimization with real-life automotive and synthetic automotive applications on heterogeneous distributed embedded systems to validate the performance and efficiency of RCFO and VFSR.


  1. 3.
  2. 4.
  3. 7.
    Abrishami, S., Naghibzadeh, M., Epema, D.H.: Deadline-constrained workflow scheduling algorithms for infrastructure as a service clouds. Futur. Gener. Comput. Syst. 29(1), 158–169 (2013)Google Scholar
  4. 10.
    Arabnejad, H., Barbosa, J.G.: A budget constrained scheduling algorithm for workflow applications. J. Grid Comput. 25(3), 1–15 (2014)Google Scholar
  5. 12.
    Arabnejad, H., Barbosa, J.G., Prodan, R.: Low-time complexity budget–deadline constrained workflow scheduling on heterogeneous resources. Futur. Gener. Comput. Syst. 55, 29–40 (2016)Google Scholar
  6. 15.
    Bansal, S., Kumar, P., Singh, K.: An improved duplication strategy for scheduling precedence constrained graphs in multiprocessor systems. IEEE Trans. Parallel Distrib. Syst. 14(6), 533–544 (2003)Google Scholar
  7. 22.
    Benoit, A., Canon, L.C., Jeannot, E., Robert, Y.: Reliability of task graph schedules with transient and fail-stop failures: complexity and algorithms. J. Sched. 15(5), 615–627 (2012)MathSciNetzbMATHGoogle Scholar
  8. 23.
    Benoit, A., Dufossé, F., Girault, A., Robert, Y.: Reliability and performance optimization of pipelined real-time systems. J. Parallel Distrib. Comput. 73(6), 851–865 (2013)zbMATHGoogle Scholar
  9. 24.
    Benoit, A., Hakem, M.: Optimizing the latency of streaming applications under throughput and reliability constraints. In: Proceedings of the International Conference on Parallel Processing, pp. 325–332. IEEE (2009)Google Scholar
  10. 25.
    Benoit, A., Hakem, M., Robert, Y.: Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In: Proceedings of the 22th IEEE International on Parallel and Distributed Processing, pp. 1–8. IEEE (2008)Google Scholar
  11. 31.
    Broberg, J., Venugopal, S., Buyya, R.: Market-oriented grids and utility computing: the state-of-the-art and future directions. J. Grid Comput. 6(3), 255–276 (2008)Google Scholar
  12. 37.
    Chen, C.Y.: Task scheduling for maximizing performance and reliability considering fault recovery in heterogeneous distributed systems. IEEE Trans. Parallel Distrib. Syst. 27(2), 521–532 (2016)Google Scholar
  13. 41.
    Chen, W., Xie, G., Li, R., Bai, Y., Fan, C., Li, K.: Efficient task scheduling for budget constrained parallel applications on heterogeneous cloud computing systems. Futur. Gener. Comput. Syst. 74, 1–11 (2017)Google Scholar
  14. 43.
    Convolbo, M.W., Chou, J.: Cost-aware DAG scheduling algorithms for minimizing execution cost on cloud resources. J. Supercomput. 72(3), 985–1012 (2016)Google Scholar
  15. 46.
    Dogan, A., Ozguner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 308–323 (2002)Google Scholar
  16. 47.
    Doğan, A., Özgüner, F.: Biobjective scheduling algorithms for execution time–reliability trade-off in heterogeneous computing systems. Comput. J. 48(3), 300–314 (2005)Google Scholar
  17. 48.
    Dongarra, J.J., Jeannot, E., Saule, E., Shi, Z.: Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 280–288. ACM (2007)Google Scholar
  18. 56.
    Gan, J., Pop, P., Madsen, J.: Tradeoff analysis for dependable real-time embedded systems during the early design phases. Ph.D. thesis, Technical University of Denmark, Department of Informatics and Mathematical Modeling (2014)Google Scholar
  19. 57.
    Girault, A., Kalla, H.: A novel bicriteria scheduling heuristics providing a guaranteed global system failure rate. IEEE Trans. Dependable Secur. C. 6(4), 241–254 (2009)Google Scholar
  20. 58.
    Girault, A., Saule, E., Trystram, D.: Reliability versus performance for critical applications. J. Parallel Distrib. Comput. 69(3), 326–336 (2009)Google Scholar
  21. 59.
    Gopalakrishnan, S., Caccamo, M.: Task partitioning with replication upon heterogeneous multiprocessor systems. In: Proceedings of the 12th IEEE International Conference on Real-Time and Embedded Technology and Applications Symposium, pp. 199–207. IEEE (2006)Google Scholar
  22. 61.
    Gu, Z., Han, G., Zeng, H., Zhao, Q.: Security-aware mapping and scheduling with hardware co-processors for FlexRay-based distributed embedded systems. IEEE Trans. Parallel Distrib. Syst. 27(10), 3044–3057 (2016)Google Scholar
  23. 66.
    Hakem, M., Butelle, F.: A bi-objective algorithm for scheduling parallel applications on heterogeneous systems subject to failures. In: RenPar2006, pp. 25–35. RenPar2006 (2006)Google Scholar
  24. 72.
    ISO, I.: 26262–road vehicles-functional safety. ISO Standard (2011)Google Scholar
  25. 82.
    Koslovski, G., Yeow, W.L., Westphal, C., Huu, T.T., Montagnat, J., Vicat-Blanc, P.: Reliability support in virtual infrastructures. In: Proceedings of the IEEE 2nd International Conference on Cloud Computing Technology and Science, pp. 49–58. IEEE (2010)Google Scholar
  26. 95.
    Li, J., Qiu, M., Ming, Z., Quan, G., Qin, X., Gu, Z.: Online optimization for scheduling preemptable tasks on IaaS cloud systems. J. Parallel Distrib. Comput. 72(5), 666–677 (2012)Google Scholar
  27. 98.
    Li, K.: Scheduling precedence constrained tasks with reduced processor energy on multiprocessor computers. IEEE Trans. Comput. 61(12), 1668–1681 (2012)MathSciNetzbMATHGoogle Scholar
  28. 101.
    Liu, J., Li, K., Zhu, D., Han, J., Li, K.: Minimizing cost of scheduling tasks on heterogeneous multicore embedded systems. ACM Trans. Embed. Comput. Syst. 16(2), 36 (2016)Google Scholar
  29. 102.
    Liu, J., Zhuge, Q., Gu, S., Hu, J., Zhu, G., Sha, E.H.M.: Minimizing system cost with efficient task assignment on heterogeneous multicore processors considering time constraint. IEEE Trans. Parallel Distrib. Syst. 25(8), 2101–2113 (2014)Google Scholar
  30. 107.
    Mei, J., Li, K., Zhou, X., Li, K.: Fault-tolerant dynamic rescheduling for heterogeneous computing systems. J. Grid Comput. 13(4), 507–525 (2015)Google Scholar
  31. 118.
    Ovatman, T., Brekling, A.W., Hansen, M.R.: Cost analysis for embedded systems: experiments with priced timed automata. Electron. Notes Theor. Comput. Sci. 238(6), 81–95 (2010)Google Scholar
  32. 123.
    Qin, X., Jiang, H.: A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems. Parallel Comput. 32(5), 331–356 (2006)MathSciNetGoogle Scholar
  33. 124.
    Qin, X., Jiang, H., Swanson, D.R.: An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems. In: Proceedings of the 31th International Conference on Parallel Processing, pp. 360–368. IEEE (2002)Google Scholar
  34. 125.
    Qiu, M., Sha, E.H.M.: Cost minimization while satisfying hard/soft timing constraints for heterogeneous embedded systems. ACM Trans. Des. Autom. Electron. Syst. (TODAES) 14(2), 25 (2009)Google Scholar
  35. 128.
    Rodriguez, M.A., Buyya, R.: Deadline based resource provisioningand scheduling algorithm for scientific workflows on clouds. IEEE Trans. Cloud Comput. 2(2), 222–235 (2014)Google Scholar
  36. 133.
    Shatz, S.M., Wang, J.P.: Models and algorithms for reliability-oriented task-allocation in redundant distributed-computer systems. IEEE Trans. Reliab. 38(1), 16–27 (1989)Google Scholar
  37. 140.
    Tabbaa, N., Entezari-Maleki, R., Movaghar, A.: A fault tolerant scheduling algorithm for DAG applications in cluster environments. In: Proceedings of the Digital Information Processing and Communications, pp. 189–199. Springer (2011)Google Scholar
  38. 142.
    Tămaş-Selicean, D., Pop, P.: Design optimization of mixed-criticality real-time embedded systems. ACM Trans. Embed. Comput. Syst. 14(3), 50 (2015)Google Scholar
  39. 151.
    T’kindt, V., Billaut, J.C.: Multicriteria scheduling: theory, models and algorithms. Springer Science & Business Media, Berlin/Heidelberg (2006)Google Scholar
  40. 152.
    Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)Google Scholar
  41. 153.
    Ullman, J.D.: Np-complete scheduling problems. J. Comput. Syst. Sci. 10(3), 384–393 (1975)MathSciNetzbMATHGoogle Scholar
  42. 155.
    Verma, A., Bhardwaj, N.: A review on routing information protocol (RIP) and open shortest path first (OSPF) routing protocol. Int. J. Futur. Gener. Commun. Netw. 9(4), 161–170 (2016)Google Scholar
  43. 160.
    Wu, C.Q., Lin, X., Yu, D., Xu, W., Li, L.: End-to-end delay minimization for scientific workflows in clouds under budget constraint. IEEE Trans. Cloud Comput. 3(2), 169–181 (2015)Google Scholar
  44. 162.
    Xie, G., Chen, Y., Liu, Y., Wei, Y., Li, R., Li, K.: Resource consumption cost minimization of reliable parallel applications on heterogeneous embedded systems. IEEE Trans. Ind. Informat. 13(4), 1629–1640 (2017)Google Scholar
  45. 164.
    Xie, G., Liu, L., Yang, L., Li, R.: Scheduling trade-off of dynamic multiple parallel workflows on heterogeneous distributed computing systems. Concurr. Comput. Pract. Exp. 29(8), 1–18 (2017). Google Scholar
  46. 166.
    Xie, G., Zeng, G., Chen, Y., Bai, Y., Zhou, Z., Li, R., Li, K.: Minimizing redundancy to satisfy reliability requirement for a parallel application on heterogeneous service-oriented systems. IEEE Trans. Serv. Comput. 1–1 (2017).
  47. 168.
    Xie, G., Zeng, G., Li, Z., Li, R., Li, K.: Adaptive dynamic scheduling on multi-functional mixed-criticality automotive cyber-physical systems. IEEE Trans. Veh. Technol. 66(8), 6676–6692 (2017)Google Scholar
  48. 172.
    Xu, Y., Koren, I., Krishna, C.M.: Adaft: a framework for adaptive fault tolerance for cyber-physical systems. ACM Trans. Embed. Comput. Syst. 16(3), 79 (2017)Google Scholar
  49. 175.
    Yuan, Y., Li, X., Wang, Q., Zhu, X.: Deadline division-based heuristic for cost optimization in workflow scheduling. Inf. Sci. 179(15), 2562–2575 (2009)zbMATHGoogle Scholar
  50. 186.
    Zhao, L., Ren, Y., Sakurai, K.: Reliable workflow scheduling with less resource redundancy. Parallel Comput. 39(10), 567–585 (2013)MathSciNetGoogle Scholar
  51. 187.
    Zhao, L., Ren, Y., Xiang, Y., Sakurai, K.: Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems. In: Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications, pp. 434–441. IEEE (2010)Google Scholar
  52. 189.
    Zheng, Q., Veeravalli, B., Tham, C.K.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)MathSciNetzbMATHGoogle Scholar
  53. 192.
    Zhou, A.C., He, B., Liu, C.: Monetary cost optimizations for hosting workflow-as-a-service in IaaS clouds. IEEE Trans. Cloud Comput. 4(1), 34–48 (2016)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  • Guoqi Xie
    • 1
  • Gang Zeng
    • 2
  • Renfa Li
    • 3
  • Keqin Li
    • 4
  1. 1.College of Computer Science and Electronic EngineeringHunan UniversityChangshaChina
  2. 2.Graduate School of EngineeringNagoya UniversityNagoyaJapan
  3. 3.Key Laboratory for Embedded and Cyber-Physical Systems of Hunan ProvinceHunan UniversityChangshaChina
  4. 4.Department of Computer ScienceState University of New YorkNew PaltzUSA

Personalised recommendations