Reliability-Aware Fault-Tolerant Scheduling

Xie, Guoqi; Zeng, Gang; Li, Renfa; Li, Keqin

doi:10.1007/978-981-13-6557-7_3

Guoqi Xie⁵,
Gang Zeng⁶,
Renfa Li⁷ &
…
Keqin Li⁸

323 Accesses

Abstract

Reliability is widely identified as an increasingly relevant issue on heterogeneous distributed cloud systems because processor failure affects the quality of service for users. Replication-based fault-tolerance is a common approach to satisfy the application’s reliability requirement. This chapter solves the problem of minimizing redundancy to satisfy reliability requirement for a parallel application on heterogeneous distributed cloud systems. In addition, this chapter also focuses on heterogeneous distributed embedded systems such as ACPS, which are safety critical systems. And response time is an another safety attribution on ACPS. So this chapter further solves the problem of cost optimization when satisfying safety requirement including reliability and response time requirement on heterogeneous distributed embedded systems such as APCS. We first propose the enough replication for redundancy minimization (ERRM) algorithm to satisfy an application’s reliability requirement, and then propose heuristic replication for redundancy minimization (HRRM) to satisfy an application’s reliability requirement with low time complexity. ERRM can generate the least redundancy followed by HRRM, and the state-of-the-art MaxRe and RR algorithm. In addition, HRRM implements approximate minimum redundancy with a short computation time. Considering that a minimum number of replicas does not necessarily lead to the minimum execution cost and shortest schedule length in a heterogeneous distributed cloud systems, we further propose the quantitative fault-tolerance with minimum execution cost (QFEC) & QFEC+ algorithms and the quantitative fault-tolerance with minimum schedule length (QFSL) & QFSL+ algorithms while satisfying the reliability requirement of the workflow. Next, we present a safety-aware fault-tolerant methodology towards the resource cost optimization for end-to-end functional safety computation on ACPS. The proposed design methodology involves early functional safety requirement verification and late resource cost design optimization. We first propose the functional safety requirement verification (FSRV) algorithm to verify the functional safety requirement consisting of reliability and response time requirements of the distributed automotive function for the early design phase. And then we propose the resource cost-aware fault-tolerant optimization (RCFO) algorithm to reduce the resource cost while satisfying the functional safety requirement of the function for the late design phase. Finally, this chapter presents different experiments toward different application environments such as CPCS and ACPS. We first do the experiments for the redundancy cost optimization on real and randomly generated parallel applications at different scales, parallelism to validate the performance of ERRM and HRRM on heterogeneous distributed systems. We then do the experiments for the execution cost and scheduling length optimization on heterogeneous distributed cloud systems to validate the efficiency of QFEC, QFEC+, QFSL and QFSL+. We finally do the experiments for the resource cost optimization with real-life automotive and synthetic automotive applications on heterogeneous distributed embedded systems to validate the performance and efficiency of RCFO and VFSR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

https://sourceforge.net/projects/taskgraphgen/
https://en.wikipedia.org/wiki/Service-level_agreement
Abrishami, S., Naghibzadeh, M., Epema, D.H.: Deadline-constrained workflow scheduling algorithms for infrastructure as a service clouds. Futur. Gener. Comput. Syst. 29(1), 158–169 (2013)
Article Google Scholar
Arabnejad, H., Barbosa, J.G.: A budget constrained scheduling algorithm for workflow applications. J. Grid Comput. 25(3), 1–15 (2014)
Google Scholar
Arabnejad, H., Barbosa, J.G., Prodan, R.: Low-time complexity budget–deadline constrained workflow scheduling on heterogeneous resources. Futur. Gener. Comput. Syst. 55, 29–40 (2016)
Article Google Scholar
Bansal, S., Kumar, P., Singh, K.: An improved duplication strategy for scheduling precedence constrained graphs in multiprocessor systems. IEEE Trans. Parallel Distrib. Syst. 14(6), 533–544 (2003)
Article Google Scholar
Benoit, A., Canon, L.C., Jeannot, E., Robert, Y.: Reliability of task graph schedules with transient and fail-stop failures: complexity and algorithms. J. Sched. 15(5), 615–627 (2012)
Article MathSciNet MATH Google Scholar
Benoit, A., Dufossé, F., Girault, A., Robert, Y.: Reliability and performance optimization of pipelined real-time systems. J. Parallel Distrib. Comput. 73(6), 851–865 (2013)
Article MATH Google Scholar
Benoit, A., Hakem, M.: Optimizing the latency of streaming applications under throughput and reliability constraints. In: Proceedings of the International Conference on Parallel Processing, pp. 325–332. IEEE (2009)
Google Scholar
Benoit, A., Hakem, M., Robert, Y.: Fault tolerant scheduling of precedence task graphs on heterogeneous platforms. In: Proceedings of the 22th IEEE International on Parallel and Distributed Processing, pp. 1–8. IEEE (2008)
Google Scholar
Broberg, J., Venugopal, S., Buyya, R.: Market-oriented grids and utility computing: the state-of-the-art and future directions. J. Grid Comput. 6(3), 255–276 (2008)
Article Google Scholar
Chen, C.Y.: Task scheduling for maximizing performance and reliability considering fault recovery in heterogeneous distributed systems. IEEE Trans. Parallel Distrib. Syst. 27(2), 521–532 (2016)
Article Google Scholar
Chen, W., Xie, G., Li, R., Bai, Y., Fan, C., Li, K.: Efficient task scheduling for budget constrained parallel applications on heterogeneous cloud computing systems. Futur. Gener. Comput. Syst. 74, 1–11 (2017)
Article Google Scholar
Convolbo, M.W., Chou, J.: Cost-aware DAG scheduling algorithms for minimizing execution cost on cloud resources. J. Supercomput. 72(3), 985–1012 (2016)
Article Google Scholar
Dogan, A., Ozguner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 308–323 (2002)
Article Google Scholar
Doğan, A., Özgüner, F.: Biobjective scheduling algorithms for execution time–reliability trade-off in heterogeneous computing systems. Comput. J. 48(3), 300–314 (2005)
Article Google Scholar
Dongarra, J.J., Jeannot, E., Saule, E., Shi, Z.: Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems. In: Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 280–288. ACM (2007)
Google Scholar
Gan, J., Pop, P., Madsen, J.: Tradeoff analysis for dependable real-time embedded systems during the early design phases. Ph.D. thesis, Technical University of Denmark, Department of Informatics and Mathematical Modeling (2014)
Google Scholar
Girault, A., Kalla, H.: A novel bicriteria scheduling heuristics providing a guaranteed global system failure rate. IEEE Trans. Dependable Secur. C. 6(4), 241–254 (2009)
Article Google Scholar
Girault, A., Saule, E., Trystram, D.: Reliability versus performance for critical applications. J. Parallel Distrib. Comput. 69(3), 326–336 (2009)
Article Google Scholar
Gopalakrishnan, S., Caccamo, M.: Task partitioning with replication upon heterogeneous multiprocessor systems. In: Proceedings of the 12th IEEE International Conference on Real-Time and Embedded Technology and Applications Symposium, pp. 199–207. IEEE (2006)
Google Scholar
Gu, Z., Han, G., Zeng, H., Zhao, Q.: Security-aware mapping and scheduling with hardware co-processors for FlexRay-based distributed embedded systems. IEEE Trans. Parallel Distrib. Syst. 27(10), 3044–3057 (2016)
Article Google Scholar
Hakem, M., Butelle, F.: A bi-objective algorithm for scheduling parallel applications on heterogeneous systems subject to failures. In: RenPar2006, pp. 25–35. RenPar2006 (2006)
Google Scholar
ISO, I.: 26262–road vehicles-functional safety. ISO Standard (2011)
Google Scholar
Koslovski, G., Yeow, W.L., Westphal, C., Huu, T.T., Montagnat, J., Vicat-Blanc, P.: Reliability support in virtual infrastructures. In: Proceedings of the IEEE 2nd International Conference on Cloud Computing Technology and Science, pp. 49–58. IEEE (2010)
Google Scholar
Li, J., Qiu, M., Ming, Z., Quan, G., Qin, X., Gu, Z.: Online optimization for scheduling preemptable tasks on IaaS cloud systems. J. Parallel Distrib. Comput. 72(5), 666–677 (2012)
Article Google Scholar
Li, K.: Scheduling precedence constrained tasks with reduced processor energy on multiprocessor computers. IEEE Trans. Comput. 61(12), 1668–1681 (2012)
Article MathSciNet MATH Google Scholar
Liu, J., Li, K., Zhu, D., Han, J., Li, K.: Minimizing cost of scheduling tasks on heterogeneous multicore embedded systems. ACM Trans. Embed. Comput. Syst. 16(2), 36 (2016)
Google Scholar
Liu, J., Zhuge, Q., Gu, S., Hu, J., Zhu, G., Sha, E.H.M.: Minimizing system cost with efficient task assignment on heterogeneous multicore processors considering time constraint. IEEE Trans. Parallel Distrib. Syst. 25(8), 2101–2113 (2014)
Article Google Scholar
Mei, J., Li, K., Zhou, X., Li, K.: Fault-tolerant dynamic rescheduling for heterogeneous computing systems. J. Grid Comput. 13(4), 507–525 (2015)
Article Google Scholar
Ovatman, T., Brekling, A.W., Hansen, M.R.: Cost analysis for embedded systems: experiments with priced timed automata. Electron. Notes Theor. Comput. Sci. 238(6), 81–95 (2010)
Article Google Scholar
Qin, X., Jiang, H.: A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems. Parallel Comput. 32(5), 331–356 (2006)
Article MathSciNet Google Scholar
Qin, X., Jiang, H., Swanson, D.R.: An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems. In: Proceedings of the 31th International Conference on Parallel Processing, pp. 360–368. IEEE (2002)
Google Scholar
Qiu, M., Sha, E.H.M.: Cost minimization while satisfying hard/soft timing constraints for heterogeneous embedded systems. ACM Trans. Des. Autom. Electron. Syst. (TODAES) 14(2), 25 (2009)
Google Scholar
Rodriguez, M.A., Buyya, R.: Deadline based resource provisioningand scheduling algorithm for scientific workflows on clouds. IEEE Trans. Cloud Comput. 2(2), 222–235 (2014)
Article Google Scholar
Shatz, S.M., Wang, J.P.: Models and algorithms for reliability-oriented task-allocation in redundant distributed-computer systems. IEEE Trans. Reliab. 38(1), 16–27 (1989)
Article Google Scholar
Tabbaa, N., Entezari-Maleki, R., Movaghar, A.: A fault tolerant scheduling algorithm for DAG applications in cluster environments. In: Proceedings of the Digital Information Processing and Communications, pp. 189–199. Springer (2011)
Google Scholar
Tămaş-Selicean, D., Pop, P.: Design optimization of mixed-criticality real-time embedded systems. ACM Trans. Embed. Comput. Syst. 14(3), 50 (2015)
Article Google Scholar
T’kindt, V., Billaut, J.C.: Multicriteria scheduling: theory, models and algorithms. Springer Science & Business Media, Berlin/Heidelberg (2006)
Google Scholar
Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)
Article Google Scholar
Ullman, J.D.: Np-complete scheduling problems. J. Comput. Syst. Sci. 10(3), 384–393 (1975)
Article MathSciNet MATH Google Scholar
Verma, A., Bhardwaj, N.: A review on routing information protocol (RIP) and open shortest path first (OSPF) routing protocol. Int. J. Futur. Gener. Commun. Netw. 9(4), 161–170 (2016)
Google Scholar
Wu, C.Q., Lin, X., Yu, D., Xu, W., Li, L.: End-to-end delay minimization for scientific workflows in clouds under budget constraint. IEEE Trans. Cloud Comput. 3(2), 169–181 (2015)
Article Google Scholar
Xie, G., Chen, Y., Liu, Y., Wei, Y., Li, R., Li, K.: Resource consumption cost minimization of reliable parallel applications on heterogeneous embedded systems. IEEE Trans. Ind. Informat. 13(4), 1629–1640 (2017)
Article Google Scholar
Xie, G., Liu, L., Yang, L., Li, R.: Scheduling trade-off of dynamic multiple parallel workflows on heterogeneous distributed computing systems. Concurr. Comput. Pract. Exp. 29(8), 1–18 (2017). https://doi.org/10.1002/cpe.3782
Google Scholar
Xie, G., Zeng, G., Chen, Y., Bai, Y., Zhou, Z., Li, R., Li, K.: Minimizing redundancy to satisfy reliability requirement for a parallel application on heterogeneous service-oriented systems. IEEE Trans. Serv. Comput. 1–1 (2017). https://doi.org/10.1109/TSC.2017.2665552
Xie, G., Zeng, G., Li, Z., Li, R., Li, K.: Adaptive dynamic scheduling on multi-functional mixed-criticality automotive cyber-physical systems. IEEE Trans. Veh. Technol. 66(8), 6676–6692 (2017)
Article Google Scholar
Xu, Y., Koren, I., Krishna, C.M.: Adaft: a framework for adaptive fault tolerance for cyber-physical systems. ACM Trans. Embed. Comput. Syst. 16(3), 79 (2017)
Article Google Scholar
Yuan, Y., Li, X., Wang, Q., Zhu, X.: Deadline division-based heuristic for cost optimization in workflow scheduling. Inf. Sci. 179(15), 2562–2575 (2009)
Article MATH Google Scholar
Zhao, L., Ren, Y., Sakurai, K.: Reliable workflow scheduling with less resource redundancy. Parallel Comput. 39(10), 567–585 (2013)
Article MathSciNet Google Scholar
Zhao, L., Ren, Y., Xiang, Y., Sakurai, K.: Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems. In: Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications, pp. 434–441. IEEE (2010)
Google Scholar
Zheng, Q., Veeravalli, B., Tham, C.K.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)
Article MathSciNet MATH Google Scholar
Zhou, A.C., He, B., Liu, C.: Monetary cost optimizations for hosting workflow-as-a-service in IaaS clouds. IEEE Trans. Cloud Comput. 4(1), 34–48 (2016)
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
Guoqi Xie
Graduate School of Engineering, Nagoya University, Nagoya, Aichi, Japan
Gang Zeng
Key Laboratory for Embedded and Cyber-Physical Systems of Hunan Province, Hunan University, Changsha, Hunan, China
Renfa Li
Department of Computer Science, State University of New York, New Paltz, NY, USA
Keqin Li

Authors

Guoqi Xie
View author publications
You can also search for this author in PubMed Google Scholar
Gang Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Renfa Li
View author publications
You can also search for this author in PubMed Google Scholar
Keqin Li
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Xie, G., Zeng, G., Li, R., Li, K. (2019). Reliability-Aware Fault-Tolerant Scheduling. In: Scheduling Parallel Applications on Heterogeneous Distributed Systems. Springer, Singapore. https://doi.org/10.1007/978-981-13-6557-7_3

Download citation

DOI: https://doi.org/10.1007/978-981-13-6557-7_3
Published: 07 August 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-6556-0
Online ISBN: 978-981-13-6557-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics