Skip to main content

Planning and Metaheuristic Optimization in Production Job Scheduler

  • Conference paper
  • First Online:
Job Scheduling Strategies for Parallel Processing (JSSPP 2015, JSSPP 2016)

Abstract

In this work we present our positive experience with a unique advanced job scheduler which we have developed for the widely used TORQUE Resource Manager. Unlike common schedulers using queuing approach and simple heuristics, our solution uses planning (job schedule construction) and schedule optimization by a local search-inspired metaheuristic. Using both complex simulations and practical deployment in a real system, we show that this approach increases predictability, performance and fairness with respect to a common queue-based scheduler. Presented scheduler has been successfully used in the production infrastructure of the Czech Centre for Education, Research and Innovation in ICT (CERIT Scientific Cloud) since July 2014.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.openccs.eu/core/.

  2. 2.

    Our system uses equal weights (\(w=1\)) for wait time and bounded slowdown while the normalized user wait time (fairness) has ten times higher weight (\(w=10\)).

  3. 3.

    http://metavo.metacentrum.cz/pbsmon2/.

  4. 4.

    Those two periods were chosen because the physical infrastructure was identical during that time. Since January 2015, the system became larger (4,512 CPUs vs 5,216 CPUs) which would skew any direct comparison of system performance.

  5. 5.

    The small size and short makespan of these experiments meant that there were few distinctive users in the workload — most of them with just few jobs — making the use of the fairness-related criterion rather impractical and inconclusive in this case.

  6. 6.

    This workload is available at: http://www.fi.muni.cz/~xklusac/workload/.

  7. 7.

    The formula is: \(acceptable\_wait = (\ln (req\_CPUs)+1)\cdot (walltime/factor)\). \(req\_CPUs\) denotes the number of requested CPUs and job’s walltime is divided by an integer (\(factor \ge 1\)) which increases as the walltime increases, emulating the non-linear user’s wait time expectations. Currently, we use five factors 1, 2, .., 5, which apply for walltimes <3h, 3h..7h, 7h..24h, 1d..7d, \({\ge }1w\), respectively.

References

  1. Adaptive Computing Enterprises, Inc., Torque 6.0.0 Administrator Guide, February 2016. http://docs.adaptivecomputing.com

  2. Alea simulator, February 2016. https://github.com/aleasimulator

  3. CERIT Scientific Cloud, February 2016. http://www.cerit-sc.cz

  4. Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). doi:10.1007/3-540-63574-2_14

    Chapter  Google Scholar 

  5. Hovestadt, M., Kao, O., Keller, A., Streit, A.: Scheduling in HPC resource management systems: queuing vs. planning. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 1–20. Springer, Heidelberg (2003). doi:10.1007/10968987_1

    Chapter  Google Scholar 

  6. Jackson, D., Snell, Q., Clement, M.: Core algorithms of the maui scheduler. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 87–102. Springer, Heidelberg (2001). doi:10.1007/3-540-45540-X_6

    Chapter  Google Scholar 

  7. Keller, A., Reinefeld, A.: Anatomy of a resource management system for HPC clusters. Annu. Rev. Scalable Comput. 3, 1–31 (2001)

    MATH  Google Scholar 

  8. Klusáček, D., Chlumský, V., Rudová, H.: Planning and optimization in TORQUE resource manager. In: 24th ACM International Symposium on High Performance Distributed Computing (HPDC), pp. 203–206. ACM (2015)

    Google Scholar 

  9. Klusác̆ek, D., Rudová, H.: Performance and fairness for users in parallel job scheduling. In: Cirne, W., Desai, N., Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2012. LNCS, vol. 7698, pp. 235–252. Springer, Heidelberg (2013). doi:10.1007/978-3-642-35867-8_13

    Chapter  Google Scholar 

  10. Klusáček, D., Rudová, H.: A metaheuristic for optimizing the performance and the fairness in job scheduling systems. In: Laalaoui, Y., Bouguila, N. (eds.) Artificial Intelligence Applications in Information and Communication Technologies. SCI, vol. 607, pp. 3–29. Springer, Cham (2015). doi:10.1007/978-3-319-19833-0_1

    Chapter  Google Scholar 

  11. Koodziej, J., Xhafa, F.: Integration of task abortion and security requirements in GA-based meta-heuristics for independent batch grid scheduling. Comput. Math. Appl. 63(2), 350–364 (2012)

    Article  MATH  Google Scholar 

  12. Li, B., Zhao, D.: Performance impact of advance reservations from the Grid on backfill algorithms. In: Sixth International Conference on Grid and Cooperative Computing (GCC 2007), pp. 456–461 (2007)

    Google Scholar 

  13. Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)

    Google Scholar 

  14. PBS Works. PBS Professional 13.0, Administrator’s Guide, February 2016. http://www.pbsworks.com

  15. Pooranian, Z., Shojafar, M., Abawajy, J., Abraham, A.: An efficient meta-heuristic algorithm for grid computing. J. Comb. Optim. 30(3), 413–434 (2015)

    Google Scholar 

  16. Stucky, K.-U., Jakob, W., Quinte, A., Süß, W.: Solving scheduling problems in grid resource management using an evolutionary algorithm. In: Meersman, R., Tari, Z. (eds.) OTM 2006. LNCS, vol. 4276, pp. 1252–1262. Springer, Heidelberg (2006). doi:10.1007/11914952_14

    Chapter  Google Scholar 

  17. Süß, W., Jakob, W., Quinte, A., Stucky, K.-U.: GORBA: a global optimising resource broker embedded in a Grid resource management system. In: International Conference on Parallel and Distributed Computing Systems, PDCS 2005, pp. 19–24. IASTED/ACTA Press (2005)

    Google Scholar 

  18. Switalski, P., Seredynski, F.: Scheduling parallel batch jobs in grids with evolutionary metaheuristics. J. Sched. 18(4), 345–357 (2015)

    Google Scholar 

  19. Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007)

    Article  Google Scholar 

  20. Xhafa, F., Abraham, A.: Metaheuristics for Scheduling in Distributed Computing Environments. SCI, vol. 146. Springer, Heidelberg (2008)

    MATH  Google Scholar 

  21. Zakay, N., Feitelson, D.G.: Preserving user behavior characteristics in trace-based simulation of parallel job scheduling. In: 22nd Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 51–60 (2014)

    Google Scholar 

  22. Zakay, N., Feitelson, D.G.: Semi-open trace based simulation for reliable evaluation of job throughput and user productivity. In: 7th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2015), pp. 413–421. IEEE (2015)

    Google Scholar 

Download references

Acknowledgments

We kindly acknowledge the support and computational resources provided by the MetaCentrum under the program LM2015042 and the CERIT Scientific Cloud under the program LM2015085, provided under the programme “Projects of Large Infrastructure for Research, Development, and Innovations”. We also highly appreciate the access to CERIT Scientific Cloud workload traces. Last but not least, we thank Dror Feitelson for his kind help and explanation concerning the dynamic workload model presented in [21].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dalibor Klusáček .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Klusáček, D., Chlumský, V. (2017). Planning and Metaheuristic Optimization in Production Job Scheduler. In: Desai, N., Cirne, W. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP JSSPP 2015 2016. Lecture Notes in Computer Science(), vol 10353. Springer, Cham. https://doi.org/10.1007/978-3-319-61756-5_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-61756-5_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-61755-8

  • Online ISBN: 978-3-319-61756-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics