Tuning EASY-Backfilling Queues

Lelong, Jérôme; Reis, Valentin; Trystram, Denis

doi:10.1007/978-3-319-77398-8_3

Jérôme Lelong¹⁶,
Valentin Reis¹⁶ &
Denis Trystram¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10773))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

642 Accesses
6 Citations

Abstract

EASY-Backfilling is a popular scheduling heuristic for allocating jobs in large scale High Performance Computing platforms. While its aggressive reservation mechanism is fast and prevents job starvation, it does not try to optimize any scheduling objective per se. We consider in this work the problem of tuning EASY using queue reordering policies. More precisely, we propose to tune the reordering using a simulation-based methodology. For a given system, we choose the policy in order to minimize the average waiting time. This methodology departs from the First-Come, First-Serve rule and introduces a risk on the maximum values of the waiting time, which we control using a queue thresholding mechanism. This new approach is evaluated through a comprehensive experimental campaign on five production logs. In particular, we show that the behavior of the systems under study is stable enough to learn a heuristic that generalizes in a train/test fashion. Indeed, the average waiting time can be reduced consistently (between 11% to 42% for the logs used) compared to EASY, with almost no increase in maximum waiting times. This work departs from previous learning-based approaches and shows that scheduling heuristics for HPC can be learned directly in a policy space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See the Parallel Workloads Archive [14] for details.
2.
Note that this is also valid for the more refined Average Bounded Slowdown [13] metric.

References

PBS Pro 13.0 administrator’s guide. http://www.pbsworks.com/pdfs/PBSAdminGuide13.0.pdf
SLURM online documentation. http://slurm.schedmd.com/sched_config.html
TOP500 online ranking. https://www.top500.org/
Ahn, D.H., Garlick, J., Grondona, M., Lipari, D., Springmeyer, B., Schulz, M.: Flux: a next-generation resource management framework for large HPC centers. In: 2014 43rd International Conference on Parallel Processing Workshops, pp. 9–17, September 2014
Google Scholar
Aida, K.: Effect of job size characteristics on job scheduling performance. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2000. LNCS, vol. 1911, pp. 1–17. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-39997-6_1. http://dl.acm.org/citation.cfm?id=646381.689680
Chapter Google Scholar
Breck, E.: zymake: a computational workflow system for machine learning and natural language processing. In: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 5–13. Association for Computational Linguistics (2008)
Google Scholar
Capit, N., Da Costa, G., Georgiou, Y., Huard, G., Martin, C., Mounié, G., Neyron, P., Richard, O.: A batch scheduler with high level components. In: IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2005, vol. 2, pp. 776–783. IEEE (2005)
Google Scholar
Casanova, H., Giersch, A., Legrand, A., Quinson, M., Suter, F.: Versatile, scalable, and accurate simulation of distributed applications and platforms. J. Parallel Distrib. Comput. 74(10), 2899–2917 (2014). http://hal.inria.fr/hal-01017319
Article Google Scholar
Chiang, S.-H., Arpaci-Dusseau, A., Vernon, M.K.: The impact of more accurate requested runtimes on production job scheduling performance. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 103–127. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_7
Chapter Google Scholar
DOE ASCAC Report: Synergistic challenges in data-intensive science and exascale computing (2013)
Google Scholar
Dolstra, E., Visser, E., de Jonge, M.: Imposing a memory management discipline on software deployment. In: Proceedings of the 26th International Conference on Software Engineering, ICSE 2004, pp. 583–592. IEEE (2004)
Google Scholar
Feitelson, D.G.: Resampling with feedback — a new paradigm of using workload data for performance evaluation. In: Dutot, P.-F., Trystram, D. (eds.) Euro-Par 2016. LNCS, vol. 9833, pp. 3–21. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-43659-3_1
Google Scholar
Feitelson, D.G., Rudolph, L.: Metrics and benchmarking for parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1998. LNCS, vol. 1459, pp. 1–24. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0053978
Chapter Google Scholar
Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014). http://www.sciencedirect.com/science/article/pii/S0743731514001154
Article Google Scholar
Frachtenberg, E., Feitelson, D.G.: Pitfalls in parallel job scheduling evaluation. In: Feitelson, D., Frachtenberg, E., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2005. LNCS, vol. 3834, pp. 257–282. Springer, Heidelberg (2005). https://doi.org/10.1007/11605300_13
Chapter Google Scholar
Gaussier, E., Glesser, D., Reis, V., Trystram, D.: Improving backfilling by using machine learning to predict running times. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 641–6410. ACM, New York (2015)
Google Scholar
Jackson, D., Snell, Q., Clement, M.: Core algorithms of the Maui scheduler. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 87–102. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45540-X_6
Chapter Google Scholar
Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM (2002)
Google Scholar
Leung, J.Y.: Handbook of Scheduling: Algorithms, Models, and Performance Analysis. CRC Press, Boca Raton (2004)
MATH Google Scholar
Lifka, D.A.: The ANL/IBM SP scheduling system. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1995. LNCS, vol. 949, pp. 295–303. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-60153-8_35
Chapter Google Scholar
Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the ibm sp2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001). https://doi.org/10.1109/71.932708
Article Google Scholar
Nissimov, A., Feitelson, D.G.: Probabilistic backfilling. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2007. LNCS, vol. 4942, pp. 102–115. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78699-3_6
Chapter Google Scholar
Perkovic, D., Keleher, P.J.: Randomization, speculation, and adaptation in batch schedulers. In: 2000 ACM/IEEE Conference on Supercomputing, p. 7, November 2000
Google Scholar
Skovira, J., Chan, W., Zhou, H., Lifka, D.: The EASY — LoadLeveler API project. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1996. LNCS, vol. 1162, pp. 41–47. Springer, Heidelberg (1996). https://doi.org/10.1007/BFb0022286. http://dl.acm.org/citation.cfm?id=646377.689506
Chapter Google Scholar
Srinivasan, S., Kettimuthu, R., Subramani, V., Sadayappan, P.: Characterization of backfilling strategies for parallel job scheduling. In: Proceedings of the International Conference on Parallel Processing Workshops, pp. 514–519. IEEE (2002)
Google Scholar
Stodden, V., Leisch, F., Peng, R.D.: Implementing Reproducible Research. CRC Press, Boca Raton (2014)
Google Scholar
Streit, A.: The self-tuning dynP job-scheduler. In: Abstracts and CD-ROM Proceedings of International Parallel and Distributed Processing Symposium, IPDPS 2002, April 2002
Google Scholar
Tsafrir, D., Feitelson, D.G.: Instability in parallel job scheduling simulation: the role of workload flurries. In: Proceedings 20th IEEE International Parallel Distributed Processing Symposium, 10 pp., April 2006
Google Scholar
Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using runtime predictions rather than user estimates. Technical report TR 5, School of Computer Science and Engineering, Hebrew University of Jerusalem (2005)
Google Scholar
Ukidave, Y., Li, X., Kaeli, D.: Mystic: predictive scheduling for GPU based cloud servers using machine learning. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 353–362, May 2016
Google Scholar
Vishnu, A., van Dam, H., Tallent, N.R., Kerbyson, D.J., Hoisie, A.: Fault modeling of extreme scale applications using machine learning. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 222–231, May 2016
Google Scholar

Download references

Acknowledgements

Authors are listed in alphabetical order. We warmly thank Eric Gaussier and Frederic Wagner for discussions as well as Pierre Neyron and Bruno Breznik for their invaluable help with experiments. We gracefully thank the contributors of the Parallel Workloads Archive, Victor Hazlewood (SDSC SP2), Travis Earheart and Nancy Wilkins-Diehr (SDSC Blue), Lars Malinowsky (KTH SP2), Dan Dwyer and Steve Hotovy (CTC SP2), Joseph Emeras (CEA Curie), and of course Dror Feitelson. This work has been partially supported by the LabEx PERSYVAL-Lab (ANR-11-LABX-0025-01) funded by the French program Investissement d’avenir. Experiments presented in this paper were carried out using the Digitalis platform (http://digitalis.imag.fr) of the Grid’5000 testbed. Grid’5000 is supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (https://www.grid5000.fr).

Author information

Authors and Affiliations

Univ. Grenoble Alpes, CNRS, Inria, LIG, LJK, Grenoble, France
Jérôme Lelong, Valentin Reis & Denis Trystram

Authors

Jérôme Lelong
View author publications
You can also search for this author in PubMed Google Scholar
Valentin Reis
View author publications
You can also search for this author in PubMed Google Scholar
Denis Trystram
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valentin Reis .

Editor information

Editors and Affiliations

CESNET, Prague, Czech Republic
Dalibor Klusáček
Google, Mountain View, California, USA
Walfredo Cirne
Google, Seattle, Washington, USA
Narayan Desai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lelong, J., Reis, V., Trystram, D. (2018). Tuning EASY-Backfilling Queues. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2017. Lecture Notes in Computer Science(), vol 10773. Springer, Cham. https://doi.org/10.1007/978-3-319-77398-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-77398-8_3
Published: 28 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77397-1
Online ISBN: 978-3-319-77398-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics