Skip to main content
  • 323 Accesses

Abstract

Reliability is widely identified as an increasingly relevant issue on heterogeneous distributed cloud systems because processor failure affects the quality of service for users. Replication-based fault-tolerance is a common approach to satisfy the application’s reliability requirement. This chapter solves the problem of minimizing redundancy to satisfy reliability requirement for a parallel application on heterogeneous distributed cloud systems. In addition, this chapter also focuses on heterogeneous distributed embedded systems such as ACPS, which are safety critical systems. And response time is an another safety attribution on ACPS. So this chapter further solves the problem of cost optimization when satisfying safety requirement including reliability and response time requirement on heterogeneous distributed embedded systems such as APCS. We first propose the enough replication for redundancy minimization (ERRM) algorithm to satisfy an application’s reliability requirement, and then propose heuristic replication for redundancy minimization (HRRM) to satisfy an application’s reliability requirement with low time complexity. ERRM can generate the least redundancy followed by HRRM, and the state-of-the-art MaxRe and RR algorithm. In addition, HRRM implements approximate minimum redundancy with a short computation time. Considering that a minimum number of replicas does not necessarily lead to the minimum execution cost and shortest schedule length in a heterogeneous distributed cloud systems, we further propose the quantitative fault-tolerance with minimum execution cost (QFEC) & QFEC+ algorithms and the quantitative fault-tolerance with minimum schedule length (QFSL) & QFSL+ algorithms while satisfying the reliability requirement of the workflow. Next, we present a safety-aware fault-tolerant methodology towards the resource cost optimization for end-to-end functional safety computation on ACPS. The proposed design methodology involves early functional safety requirement verification and late resource cost design optimization. We first propose the functional safety requirement verification (FSRV) algorithm to verify the functional safety requirement consisting of reliability and response time requirements of the distributed automotive function for the early design phase. And then we propose the resource cost-aware fault-tolerant optimization (RCFO) algorithm to reduce the resource cost while satisfying the functional safety requirement of the function for the late design phase. Finally, this chapter presents different experiments toward different application environments such as CPCS and ACPS. We first do the experiments for the redundancy cost optimization on real and randomly generated parallel applications at different scales, parallelism to validate the performance of ERRM and HRRM on heterogeneous distributed systems. We then do the experiments for the execution cost and scheduling length optimization on heterogeneous distributed cloud systems to validate the efficiency of QFEC, QFEC+, QFSL and QFSL+. We finally do the experiments for the resource cost optimization with real-life automotive and synthetic automotive applications on heterogeneous distributed embedded systems to validate the performance and efficiency of RCFO and VFSR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. https://sourceforge.net/projects/taskgraphgen/

  2. https://en.wikipedia.org/wiki/Service-level_agreement

  3. Abrishami, S., Naghibzadeh, M., Epema, D.H.: Deadline-constrained workflow scheduling algorithms for infrastructure as a service clouds. Futur. Gener. Comput. Syst. 29(1), 158–169 (2013)

    Article  Google Scholar 

  4. Arabnejad, H., Barbosa, J.G.: A budget constrained scheduling algorithm for workflow applications. J. Grid Comput. 25(3), 1–15 (2014)

    Google Scholar 

  5. Arabnejad, H., Barbosa, J.G., Prodan, R.: Low-time complexity budget–deadline constrained workflow scheduling on heterogeneous resources. Futur. Gener. Comput. Syst. 55, 29–40 (2016)

    Article  Google Scholar 

  6. Bansal, S., Kumar, P., Singh, K.: An improved duplication strategy for scheduling precedence constrained graphs in multiprocessor systems. IEEE Trans. Parallel Distrib. Syst. 14(6), 533–544 (2003)

    Article  Google Scholar 

  7. Benoit, A., Canon, L.C., Jeannot, E., Robert, Y.: Reliability of task graph schedules with transient and fail-stop failures: complexity and algorithms. J. Sched. 15(5), 615–627 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  8. Benoit, A., Dufossé, F., Girault, A., Robert, Y.: Reliability and performance optimization of pipelined real-time systems. J. Parallel Distrib. Comput. 73(6), 851–865 (2013)

    Article  MATH  Google Scholar 

  9. Benoit, A., Hakem, M.: Optimizing the latency of streaming applications under throughput and reliability constraints. In: Proceedings of the International Conference on Parallel Processing, pp. 325–332. IEEE (2009)

    Google Scholar 

  10. Benoit, A., Hakem, M., Robert, Y.: Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In: Proceedings of the 22th IEEE International on Parallel and Distributed Processing, pp. 1–8. IEEE (2008)

    Google Scholar 

  11. Broberg, J., Venugopal, S., Buyya, R.: Market-oriented grids and utility computing: the state-of-the-art and future directions. J. Grid Comput. 6(3), 255–276 (2008)

    Article  Google Scholar 

  12. Chen, C.Y.: Task scheduling for maximizing performance and reliability considering fault recovery in heterogeneous distributed systems. IEEE Trans. Parallel Distrib. Syst. 27(2), 521–532 (2016)

    Article  Google Scholar 

  13. Chen, W., Xie, G., Li, R., Bai, Y., Fan, C., Li, K.: Efficient task scheduling for budget constrained parallel applications on heterogeneous cloud computing systems. Futur. Gener. Comput. Syst. 74, 1–11 (2017)

    Article  Google Scholar 

  14. Convolbo, M.W., Chou, J.: Cost-aware DAG scheduling algorithms for minimizing execution cost on cloud resources. J. Supercomput. 72(3), 985–1012 (2016)

    Article  Google Scholar 

  15. Dogan, A., Ozguner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 308–323 (2002)

    Article  Google Scholar 

  16. Doğan, A., Özgüner, F.: Biobjective scheduling algorithms for execution time–reliability trade-off in heterogeneous computing systems. Comput. J. 48(3), 300–314 (2005)

    Article  Google Scholar 

  17. Dongarra, J.J., Jeannot, E., Saule, E., Shi, Z.: Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 280–288. ACM (2007)

    Google Scholar 

  18. Gan, J., Pop, P., Madsen, J.: Tradeoff analysis for dependable real-time embedded systems during the early design phases. Ph.D. thesis, Technical University of Denmark, Department of Informatics and Mathematical Modeling (2014)

    Google Scholar 

  19. Girault, A., Kalla, H.: A novel bicriteria scheduling heuristics providing a guaranteed global system failure rate. IEEE Trans. Dependable Secur. C. 6(4), 241–254 (2009)

    Article  Google Scholar 

  20. Girault, A., Saule, E., Trystram, D.: Reliability versus performance for critical applications. J. Parallel Distrib. Comput. 69(3), 326–336 (2009)

    Article  Google Scholar 

  21. Gopalakrishnan, S., Caccamo, M.: Task partitioning with replication upon heterogeneous multiprocessor systems. In: Proceedings of the 12th IEEE International Conference on Real-Time and Embedded Technology and Applications Symposium, pp. 199–207. IEEE (2006)

    Google Scholar 

  22. Gu, Z., Han, G., Zeng, H., Zhao, Q.: Security-aware mapping and scheduling with hardware co-processors for FlexRay-based distributed embedded systems. IEEE Trans. Parallel Distrib. Syst. 27(10), 3044–3057 (2016)

    Article  Google Scholar 

  23. Hakem, M., Butelle, F.: A bi-objective algorithm for scheduling parallel applications on heterogeneous systems subject to failures. In: RenPar2006, pp. 25–35. RenPar2006 (2006)

    Google Scholar 

  24. ISO, I.: 26262–road vehicles-functional safety. ISO Standard (2011)

    Google Scholar 

  25. Koslovski, G., Yeow, W.L., Westphal, C., Huu, T.T., Montagnat, J., Vicat-Blanc, P.: Reliability support in virtual infrastructures. In: Proceedings of the IEEE 2nd International Conference on Cloud Computing Technology and Science, pp. 49–58. IEEE (2010)

    Google Scholar 

  26. Li, J., Qiu, M., Ming, Z., Quan, G., Qin, X., Gu, Z.: Online optimization for scheduling preemptable tasks on IaaS cloud systems. J. Parallel Distrib. Comput. 72(5), 666–677 (2012)

    Article  Google Scholar 

  27. Li, K.: Scheduling precedence constrained tasks with reduced processor energy on multiprocessor computers. IEEE Trans. Comput. 61(12), 1668–1681 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  28. Liu, J., Li, K., Zhu, D., Han, J., Li, K.: Minimizing cost of scheduling tasks on heterogeneous multicore embedded systems. ACM Trans. Embed. Comput. Syst. 16(2), 36 (2016)

    Google Scholar 

  29. Liu, J., Zhuge, Q., Gu, S., Hu, J., Zhu, G., Sha, E.H.M.: Minimizing system cost with efficient task assignment on heterogeneous multicore processors considering time constraint. IEEE Trans. Parallel Distrib. Syst. 25(8), 2101–2113 (2014)

    Article  Google Scholar 

  30. Mei, J., Li, K., Zhou, X., Li, K.: Fault-tolerant dynamic rescheduling for heterogeneous computing systems. J. Grid Comput. 13(4), 507–525 (2015)

    Article  Google Scholar 

  31. Ovatman, T., Brekling, A.W., Hansen, M.R.: Cost analysis for embedded systems: experiments with priced timed automata. Electron. Notes Theor. Comput. Sci. 238(6), 81–95 (2010)

    Article  Google Scholar 

  32. Qin, X., Jiang, H.: A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems. Parallel Comput. 32(5), 331–356 (2006)

    Article  MathSciNet  Google Scholar 

  33. Qin, X., Jiang, H., Swanson, D.R.: An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems. In: Proceedings of the 31th International Conference on Parallel Processing, pp. 360–368. IEEE (2002)

    Google Scholar 

  34. Qiu, M., Sha, E.H.M.: Cost minimization while satisfying hard/soft timing constraints for heterogeneous embedded systems. ACM Trans. Des. Autom. Electron. Syst. (TODAES) 14(2), 25 (2009)

    Google Scholar 

  35. Rodriguez, M.A., Buyya, R.: Deadline based resource provisioningand scheduling algorithm for scientific workflows on clouds. IEEE Trans. Cloud Comput. 2(2), 222–235 (2014)

    Article  Google Scholar 

  36. Shatz, S.M., Wang, J.P.: Models and algorithms for reliability-oriented task-allocation in redundant distributed-computer systems. IEEE Trans. Reliab. 38(1), 16–27 (1989)

    Article  Google Scholar 

  37. Tabbaa, N., Entezari-Maleki, R., Movaghar, A.: A fault tolerant scheduling algorithm for DAG applications in cluster environments. In: Proceedings of the Digital Information Processing and Communications, pp. 189–199. Springer (2011)

    Google Scholar 

  38. Tămaş-Selicean, D., Pop, P.: Design optimization of mixed-criticality real-time embedded systems. ACM Trans. Embed. Comput. Syst. 14(3), 50 (2015)

    Article  Google Scholar 

  39. T’kindt, V., Billaut, J.C.: Multicriteria scheduling: theory, models and algorithms. Springer Science & Business Media, Berlin/Heidelberg (2006)

    Google Scholar 

  40. Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)

    Article  Google Scholar 

  41. Ullman, J.D.: Np-complete scheduling problems. J. Comput. Syst. Sci. 10(3), 384–393 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  42. Verma, A., Bhardwaj, N.: A review on routing information protocol (RIP) and open shortest path first (OSPF) routing protocol. Int. J. Futur. Gener. Commun. Netw. 9(4), 161–170 (2016)

    Google Scholar 

  43. Wu, C.Q., Lin, X., Yu, D., Xu, W., Li, L.: End-to-end delay minimization for scientific workflows in clouds under budget constraint. IEEE Trans. Cloud Comput. 3(2), 169–181 (2015)

    Article  Google Scholar 

  44. Xie, G., Chen, Y., Liu, Y., Wei, Y., Li, R., Li, K.: Resource consumption cost minimization of reliable parallel applications on heterogeneous embedded systems. IEEE Trans. Ind. Informat. 13(4), 1629–1640 (2017)

    Article  Google Scholar 

  45. Xie, G., Liu, L., Yang, L., Li, R.: Scheduling trade-off of dynamic multiple parallel workflows on heterogeneous distributed computing systems. Concurr. Comput. Pract. Exp. 29(8), 1–18 (2017). https://doi.org/10.1002/cpe.3782

    Google Scholar 

  46. Xie, G., Zeng, G., Chen, Y., Bai, Y., Zhou, Z., Li, R., Li, K.: Minimizing redundancy to satisfy reliability requirement for a parallel application on heterogeneous service-oriented systems. IEEE Trans. Serv. Comput. 1–1 (2017). https://doi.org/10.1109/TSC.2017.2665552

  47. Xie, G., Zeng, G., Li, Z., Li, R., Li, K.: Adaptive dynamic scheduling on multi-functional mixed-criticality automotive cyber-physical systems. IEEE Trans. Veh. Technol. 66(8), 6676–6692 (2017)

    Article  Google Scholar 

  48. Xu, Y., Koren, I., Krishna, C.M.: Adaft: a framework for adaptive fault tolerance for cyber-physical systems. ACM Trans. Embed. Comput. Syst. 16(3), 79 (2017)

    Article  Google Scholar 

  49. Yuan, Y., Li, X., Wang, Q., Zhu, X.: Deadline division-based heuristic for cost optimization in workflow scheduling. Inf. Sci. 179(15), 2562–2575 (2009)

    Article  MATH  Google Scholar 

  50. Zhao, L., Ren, Y., Sakurai, K.: Reliable workflow scheduling with less resource redundancy. Parallel Comput. 39(10), 567–585 (2013)

    Article  MathSciNet  Google Scholar 

  51. Zhao, L., Ren, Y., Xiang, Y., Sakurai, K.: Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems. In: Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications, pp. 434–441. IEEE (2010)

    Google Scholar 

  52. Zheng, Q., Veeravalli, B., Tham, C.K.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  53. Zhou, A.C., He, B., Liu, C.: Monetary cost optimizations for hosting workflow-as-a-service in IaaS clouds. IEEE Trans. Cloud Comput. 4(1), 34–48 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Xie, G., Zeng, G., Li, R., Li, K. (2019). Reliability-Aware Fault-Tolerant Scheduling. In: Scheduling Parallel Applications on Heterogeneous Distributed Systems. Springer, Singapore. https://doi.org/10.1007/978-981-13-6557-7_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-6557-7_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-6556-0

  • Online ISBN: 978-981-13-6557-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics