Skip to main content

A Metaheuristic for Optimizing the Performance and the Fairness in Job Scheduling Systems

  • Chapter
  • First Online:
Artificial Intelligence Applications in Information and Communication Technologies

Part of the book series: Studies in Computational Intelligence ((SCI,volume 607))

Abstract

Many studies in the past two decades focused on the problem of efficient resource management and job scheduling in large computational systems such as HPC clusters and Grids. For this purpose, the application of Artificial Intelligence-based methods such as metaheuristics has been proposed in many works. This chapter provides an overview of such works that involve metaheuristics and discusses why mainstream resource management and scheduling systems are instead using only a limited set of rather simple scheduling policies. We identify several reasons that are causing this situation, e.g., a common use of overly simplified problem definitions with rather naive job and machine models or an application of unrealistic optimization criteria. In order to solve aforementioned issues, this chapter proposes new complex and well designed approaches that involve the use of metaheuristic which periodically optimizes job scheduling plan using several real life based optimization criteria. Importantly, approaches described in this chapter are successfully used in practice, i.e., within a production job scheduler which manages the computing infrastructure of the Czech Centre for Education, Research and Innovation in ICT (CERIT Scientific Cloud).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    To avoid huge slowdowns of extremely short jobs, the minimal job runtime is bounded by some predefined time constant (e.g., 10 s), sometimes called a “threshold of interactivity” [10].

  2. 2.

    If a job is not the first in the queue, new jobs that arrive later may skip it in the queue. While such jobs do not delay the first job in the queue, they may delay all other jobs and the system cannot predict when a queued job will eventually run [4].

  3. 3.

    When required, the schedule can be recreated from scratch, e.g., due to a machine failure or early job completion as discussed in Sect. 3.2.1. Still, no optimization or evaluation is applied during this process.

  4. 4.

    Equalizing normalized users’ wait times is an analogy to the well known fair-share mechanism [17] which is commonly applied in production systems (see Sect. 3.3).

  5. 5.

    http://www.metacentrum.cz.

  6. 6.

    https://github.com/aleasimulator.

  7. 7.

    Except for CERIT-SC, all workloads come from the Parallel Workloads Archive [47]. CERIT-SC can be obtained at http://www.fi.muni.cz/~xklusac/workload/.

  8. 8.

    Other policies like Conservative backfilling using linear compression with RS optimization being disabled (CONS-L), First Come First Served (FCFS) [12], or Shortest Job First (SJF) [16] were also tested, but they performed poorly compared to other algorithms. Therefore, we do not present them in the figures for better visibility.

  9. 9.

    http://www.cerit-sc.cz.

References

  1. Kleban, S.D., Clearwater, S.H.: Fair share on high performance computing systems: What does fair really mean? In: Third IEEE International Symposium on Cluster Computing and the Grid (CCGrid’03), pp. 146–153. IEEE (2003)

    Google Scholar 

  2. Klusáček, D., Rudová, H.: Performance and fairness for users in parallel job scheduling. In: Cirne, W. (ed.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 7698, pp. 235–252. Springer (2012)

    Google Scholar 

  3. Hovestadt, M., Kao, O., Keller, A., Streit, A.: Scheduling in HPC resource management systems: queuing vs. planning. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 2862, pp. 1–20. Springer (2003)

    Google Scholar 

  4. Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)

    Article  Google Scholar 

  5. Xhafa, F., Abraham, A.: Metaheuristics for Scheduling in Distributed Computing Environments. Studies in Computational Intelligence, vol. 146. Springer, Berlin (2008)

    Google Scholar 

  6. Klusáček, D., Tóth, Š.: On interactions among scheduling policies: finding efficient queue setup using high-resolution simulations. In: Silva, F., Dutra, I., Costa, V.S. (eds.) Euro-Par 2014. LNCS, vol. 8632, pp. 138–149. Springer (2014)

    Google Scholar 

  7. Adaptive Computing Enterprises, Inc.: Moab Workload Manager, Jan 2015. http://docs.adaptivecomputing.com/

  8. Klusáček, D.: Event-based optimization of schedules for grid jobs. Ph.D. thesis, Masaryk University, 2011

    Google Scholar 

  9. Klusáček, D., Rudová, H.: Efficient grid scheduling through the incremental schedule-based approach. Comput. Intell.: Int. J. 27(1), 4–22 (2011)

    Article  MATH  Google Scholar 

  10. Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 1291, pp. 1–34. Springer (1997)

    Google Scholar 

  11. Tsafrir, D., Etsion, Y., Feitelson, D.G.: Modeling user runtime estimates. In: Feitelson, D.G., Frachtenberg, E., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 3834, pp. 1–35. Springer (2005)

    Google Scholar 

  12. PBS Works: PBS Professional 12.1, Administrator’s Guide, Jan 2015. http://www.pbsworks.com

  13. Ernemann, C., Hamscher, V., Yahyapour, R.: Benefits of global grid computing for job scheduling. In: GRID’04: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing, pp. 374–379. IEEE (2004)

    Google Scholar 

  14. Sabin, G., Kochhar, G., Sadayappan, P.: Job fairness in non-preemptive job scheduling. In: International Conference on Parallel Processing (ICPP’04), pp. 186–194. IEEE Computer Society (2004)

    Google Scholar 

  15. Sabin, G., Sadayappan, P.: Unfairness metrics for space-sharing parallel job schedulers. In: Feitelson, D.G., Frachtenberg, E., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 3834, pp. 238–256. Springer (2005)

    Google Scholar 

  16. Srinivasan, S., Kettimuthu, R., Subramani, V., Sadayappan, P.: Selective reservation strategies for backfill job scheduling. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 2537, pp. 55–71. Springer (2002)

    Google Scholar 

  17. Jackson, D., Snell, Q., Clement, M.: Core algorithms of the Maui scheduler. In: Feitelson, D.G., Rudolph, L. (eds.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 2221, pp. 87–102. Springer (2001)

    Google Scholar 

  18. Adaptive Computing Enterprises, Inc.: TORQUE Resource Manager, Jan 2015. http://docs.adaptivecomputing.com/

  19. Lifka, D.A.: The ANL/IBM SP scheduling system. In: Feitelson, D.G., Rudolph, L. (eds.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 949, pp. 295–303. Springer (1995)

    Google Scholar 

  20. Talby, D., Feitelson, D.G.: Supporting priorities and improving utilization of the IBM SP scheduler using slack-based backfilling. In: IPPS’99/SPDP’99: Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing, pp. 513–517. IEEE Computer Society (1999)

    Google Scholar 

  21. Feitelson, D.G.: Experimental analysis of the root causes of performance evaluation results: a backfilling case study. IEEE Trans. Parallel Distrib. Syst. 16(2), 175–182 (2005)

    Article  MATH  Google Scholar 

  22. Li, B., Zhao, D.: Performance impact of advance reservations from the grid on backfill algorithms. In: Sixth International Conference on Grid and Cooperative Computing (GCC 2007), pp. 456–461 (2007)

    Google Scholar 

  23. Ngubiri, J.: Techniques and evaluation of processor co-allocation in multi-cluster systems. Ph.D. thesis, Radboud University Nijmegen, 2008

    Google Scholar 

  24. Feitelson, D.G., Weil, A.M.: Utilization and predictability in scheduling the IBM SP2 with backfilling. In: 12th International Parallel Processing Symposium, pp. 542–546. IEEE (1998)

    Google Scholar 

  25. Chiang, S.-H., Arpaci-Dusseau, A., Vernon, M.K.: The impact of more accurate requested runtimes on production job scheduling performance. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 2537, pp. 103–127. Springer (2002)

    Google Scholar 

  26. Smith, W., Taylor, V., Foster, I.: Using run-time predictions to estimate queue wait times and improve scheduler performance. In: Feitelson, D.G., Rudolph, L. (eds.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 1659, pp. 202–219. Springer (1999)

    Google Scholar 

  27. Srinivasan, S., Kettimuthu, R., Subrarnani, V., Sadayappan, P.: Characterization of backfilling strategies for parallel job scheduling. In: Proceedings of 2002 International Workshops on Parallel Processing, pp. 514–519. IEEE Computer Society (2002)

    Google Scholar 

  28. Yousif, A., Abdullah, A.H., Nor, S.M., Abdelaziz, A.A.: Scheduling jobs on grid computing using firefly algorithm. J. Theor. Appl. Inf. Technol. 33(2), 155–164 (2011)

    Google Scholar 

  29. Abraham, A., Liu, H., Grosan, C., Xhafa, F.: Nature inspired meta-heuristics for grid scheduling: single and multi-objective optimization approaches. In: Metaheuristics for Scheduling in Distributed Computing Environments [5], pp. 247–272 (2008)

    Google Scholar 

  30. Abramson, D., Buyya, R., Murshed, M., Venugopal, S.: Scheduling parameter sweep applications on global grids: a deadline and budget constrained cost-time optimisation algorithm. Softw.: Pract. Exper. 35(5):491–512 (2005)

    Google Scholar 

  31. Stucky, K.-U., Jakob, W., Quinte, A., Süß, W.: Solving scheduling problems in grid resource management using an evolutionary algorithm. In: On the Move to Meaningful Internet Systems 2006: CoopIS, DOA, GADA, and ODBASE. LNCS, vol. 4276, pp. 1252–1262. Springer (2006)

    Google Scholar 

  32. Kumar, R., Vadhiyar, S.: Prediction of queue waiting times for metascheduling on parallel batch systems. In: Cirne, W. (ed.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 8828. Springer (2015)

    Google Scholar 

  33. Nurmi, D., Brevik, J., Wolski, R.: QBETS: queue bounds estimation from time series. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 4942, pp. 76–101. Springer (2007)

    Google Scholar 

  34. Klusáček, D., Chlumský, V., Rudová, H.: Optimizing user oriented job scheduling within TORQUE. In: SuperComputing—The International Conference for High Performance Computing, Networking, Storage and Analysis. Poster, 2013

    Google Scholar 

  35. Keller, A., Reinefeld, A.: Anatomy of a resource management system for HPC clusters. Annu. Rev. Scalable Comput. 3, 1–31 (2001)

    Article  MATH  Google Scholar 

  36. Subrata, R., Zomaya, A.Y., Landfeldt, B.: Artificial life techniques for load balancing in computational grids. J. Comput. Syst. Sci. 73(8), 1176–1190 (2007)

    Article  MATH  Google Scholar 

  37. Ritchie, G., Levine, J.: A fast, effective local search for scheduling independent jobs in heterogeneous computing environments. In: Porteous, J. (ed.) 22nd Workshop of the UK Planning and Scheduling Special Interest Group (PlanSig 03), 2003

    Google Scholar 

  38. Carretero, J., Xhafa, F.: Using genetic algorithms for scheduling jobs in large scale grid applications. J. Technol. Econ. Dev. Res. J. Vilnius Gediminas Tech. Univ. 12(1), 11–17 (2006)

    MATH  Google Scholar 

  39. Asim YarKhan, J.J.D.: Experiments with scheduling using simulated annealing in a grid environment. In: Parashar, M. (ed.) GRID. LNCS, vol. 2536. Springer (2002)

    Google Scholar 

  40. Koodziej, J., Xhafa, F.: Integration of task abortion and security requirements in GA-based meta-heuristics for independent batch grid scheduling. Comput. Math. Appl. 63(2), 350–364 (2012)

    Article  Google Scholar 

  41. Switalski, P., Seredynski, F.: Scheduling parallel batch jobs in grids with evolutionary metaheuristics. J. Sched. 1–13 (2014)

    Google Scholar 

  42. Pooranian, Z., Shojafar, M., Abawajy, J., Abraham, A.: An efficient meta-heuristic algorithm for grid computing. J. Comb. Optim. 1–22 (2013)

    Google Scholar 

  43. Xhafa, F., Abraham, A.: Computational models and heuristic methods for grid scheduling problems. Future Gener. Comput. Syst. 26(4), 608–621 (2010)

    Article  MATH  Google Scholar 

  44. Süß, W., Jakob, W., Quinte, A., Stucky, K.-U.: GORBA: a global optimising resource broker embedded in a Grid resource management system. In: International Conference on Parallel and Distributed Computing Systems, PDCS 2005, pp. 19–24. IASTED/ACTA Press (2005)

    Google Scholar 

  45. Jakob, W., Quinte, A. Stucky, K.-U., Süß, W.: Optimised scheduling of Grid resources using hybrid evolutionary algorithms. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Wasniewski, J. (eds.) Parallel Processing and Applied Mathematics, 6th International Conference, PPAM 2005. LNCS, vol. 3911, pp. 406–413. Springer (2005)

    Google Scholar 

  46. Sulistio, A., Cibej, U., Venugopal, S., Robic, B., Buyya, R.: A toolkit for modelling and simulating data grids: an extension to GridSim. Concurr. Comput.: Pract. Exper. 20(13), 1591–1609 (2008)

    Article  Google Scholar 

  47. Feitelson, D.G.: Parallel workloads archive (PWA), Jan 2015. http://www.cs.huji.ac.il/labs/parallel/workload/

Download references

Acknowledgments

We highly appreciate the support of the Grant Agency of the Czech Republic under the grant No. P202/12/0306. The access to the MetaCentrum computing facilities and workloads provided under the program “Projects of Large Infrastructure for Research, Development, and Innovations” LM2010005 funded by the Ministry of Education, Youth, and Sports of the Czech Republic is highly appreciated.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dalibor Klusáček .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Klusáček, D., Rudová, H. (2015). A Metaheuristic for Optimizing the Performance and the Fairness in Job Scheduling Systems. In: Laalaoui, Y., Bouguila, N. (eds) Artificial Intelligence Applications in Information and Communication Technologies. Studies in Computational Intelligence, vol 607. Springer, Cham. https://doi.org/10.1007/978-3-319-19833-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-19833-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-19832-3

  • Online ISBN: 978-3-319-19833-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics