Abstract
In this work we present our positive experience with a unique advanced job scheduler which we have developed for the widely used TORQUE Resource Manager. Unlike common schedulers using queuing approach and simple heuristics, our solution uses planning (job schedule construction) and schedule optimization by a local search-inspired metaheuristic. Using both complex simulations and practical deployment in a real system, we show that this approach increases predictability, performance and fairness with respect to a common queue-based scheduler. Presented scheduler has been successfully used in the production infrastructure of the Czech Centre for Education, Research and Innovation in ICT (CERIT Scientific Cloud) since July 2014.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Our system uses equal weights (\(w=1\)) for wait time and bounded slowdown while the normalized user wait time (fairness) has ten times higher weight (\(w=10\)).
- 3.
- 4.
Those two periods were chosen because the physical infrastructure was identical during that time. Since January 2015, the system became larger (4,512 CPUs vs 5,216 CPUs) which would skew any direct comparison of system performance.
- 5.
The small size and short makespan of these experiments meant that there were few distinctive users in the workload — most of them with just few jobs — making the use of the fairness-related criterion rather impractical and inconclusive in this case.
- 6.
This workload is available at: http://www.fi.muni.cz/~xklusac/workload/.
- 7.
The formula is: \(acceptable\_wait = (\ln (req\_CPUs)+1)\cdot (walltime/factor)\). \(req\_CPUs\) denotes the number of requested CPUs and job’s walltime is divided by an integer (\(factor \ge 1\)) which increases as the walltime increases, emulating the non-linear user’s wait time expectations. Currently, we use five factors 1, 2, .., 5, which apply for walltimes <3h, 3h..7h, 7h..24h, 1d..7d, \({\ge }1w\), respectively.
References
Adaptive Computing Enterprises, Inc., Torque 6.0.0 Administrator Guide, February 2016. http://docs.adaptivecomputing.com
Alea simulator, February 2016. https://github.com/aleasimulator
CERIT Scientific Cloud, February 2016. http://www.cerit-sc.cz
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). doi:10.1007/3-540-63574-2_14
Hovestadt, M., Kao, O., Keller, A., Streit, A.: Scheduling in HPC resource management systems: queuing vs. planning. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 1–20. Springer, Heidelberg (2003). doi:10.1007/10968987_1
Jackson, D., Snell, Q., Clement, M.: Core algorithms of the maui scheduler. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 87–102. Springer, Heidelberg (2001). doi:10.1007/3-540-45540-X_6
Keller, A., Reinefeld, A.: Anatomy of a resource management system for HPC clusters. Annu. Rev. Scalable Comput. 3, 1–31 (2001)
Klusáček, D., Chlumský, V., Rudová, H.: Planning and optimization in TORQUE resource manager. In: 24th ACM International Symposium on High Performance Distributed Computing (HPDC), pp. 203–206. ACM (2015)
Klusác̆ek, D., Rudová, H.: Performance and fairness for users in parallel job scheduling. In: Cirne, W., Desai, N., Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2012. LNCS, vol. 7698, pp. 235–252. Springer, Heidelberg (2013). doi:10.1007/978-3-642-35867-8_13
Klusáček, D., Rudová, H.: A metaheuristic for optimizing the performance and the fairness in job scheduling systems. In: Laalaoui, Y., Bouguila, N. (eds.) Artificial Intelligence Applications in Information and Communication Technologies. SCI, vol. 607, pp. 3–29. Springer, Cham (2015). doi:10.1007/978-3-319-19833-0_1
Koodziej, J., Xhafa, F.: Integration of task abortion and security requirements in GA-based meta-heuristics for independent batch grid scheduling. Comput. Math. Appl. 63(2), 350–364 (2012)
Li, B., Zhao, D.: Performance impact of advance reservations from the Grid on backfill algorithms. In: Sixth International Conference on Grid and Cooperative Computing (GCC 2007), pp. 456–461 (2007)
Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)
PBS Works. PBS Professional 13.0, Administrator’s Guide, February 2016. http://www.pbsworks.com
Pooranian, Z., Shojafar, M., Abawajy, J., Abraham, A.: An efficient meta-heuristic algorithm for grid computing. J. Comb. Optim. 30(3), 413–434 (2015)
Stucky, K.-U., Jakob, W., Quinte, A., Süß, W.: Solving scheduling problems in grid resource management using an evolutionary algorithm. In: Meersman, R., Tari, Z. (eds.) OTM 2006. LNCS, vol. 4276, pp. 1252–1262. Springer, Heidelberg (2006). doi:10.1007/11914952_14
Süß, W., Jakob, W., Quinte, A., Stucky, K.-U.: GORBA: a global optimising resource broker embedded in a Grid resource management system. In: International Conference on Parallel and Distributed Computing Systems, PDCS 2005, pp. 19–24. IASTED/ACTA Press (2005)
Switalski, P., Seredynski, F.: Scheduling parallel batch jobs in grids with evolutionary metaheuristics. J. Sched. 18(4), 345–357 (2015)
Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007)
Xhafa, F., Abraham, A.: Metaheuristics for Scheduling in Distributed Computing Environments. SCI, vol. 146. Springer, Heidelberg (2008)
Zakay, N., Feitelson, D.G.: Preserving user behavior characteristics in trace-based simulation of parallel job scheduling. In: 22nd Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 51–60 (2014)
Zakay, N., Feitelson, D.G.: Semi-open trace based simulation for reliable evaluation of job throughput and user productivity. In: 7th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2015), pp. 413–421. IEEE (2015)
Acknowledgments
We kindly acknowledge the support and computational resources provided by the MetaCentrum under the program LM2015042 and the CERIT Scientific Cloud under the program LM2015085, provided under the programme “Projects of Large Infrastructure for Research, Development, and Innovations”. We also highly appreciate the access to CERIT Scientific Cloud workload traces. Last but not least, we thank Dror Feitelson for his kind help and explanation concerning the dynamic workload model presented in [21].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Klusáček, D., Chlumský, V. (2017). Planning and Metaheuristic Optimization in Production Job Scheduler. In: Desai, N., Cirne, W. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP JSSPP 2015 2016. Lecture Notes in Computer Science(), vol 10353. Springer, Cham. https://doi.org/10.1007/978-3-319-61756-5_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-61756-5_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61755-8
Online ISBN: 978-3-319-61756-5
eBook Packages: Computer ScienceComputer Science (R0)