Evaluating the Impact of Soft Walltimes on Job Scheduling Performance

Klusáček, Dalibor; Chlumský, Václav

doi:10.1007/978-3-030-10632-4_2

Evaluating the Impact of Soft Walltimes on Job Scheduling Performance

Dalibor Klusáček¹⁵ &
Václav Chlumský¹⁵

Conference paper
First Online: 13 January 2019

416 Accesses
12 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11332))

Abstract

For two decades researchers have been analyzing the impact of inaccurate job walltime (runtime) estimates on the performance of job scheduling algorithms, especially in case of backfilling. Several studies analyzed the pros and cons of using accurate vs. inaccurate estimates. Some researchers focused on the ways users of the system can be motivated to provide more accurate runtime estimates. The recent addition of so-called “soft walltime” parameter in the widely used PBS Professional enables a system administrator to actually use some of these techniques to refine user-provided walltime estimates. The obvious question of a system administrator is whether such walltime predictions are useful and “safe” and what will be the impact on the overall system performance. In this work, we use several detailed simulations to analyze the actual impact of using soft walltimes in a job scheduler, discussing the scenarios when such “refined” estimates can be meaningfully used.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
In case that a given user has no completed jobs so far then such historic information is obviously missing, thus we use the user-provided estimate instead.
2.
The system is configured to kill a job if it exceeds user’s walltime estimate.
3.
Unlike in the static scenario, user-oriented analysis makes a great sense when the workload is dynamically adapted.
4.
For example, if the result is 25% it means that, e.g., the original wait time was decreased by 25%. On the other hand, if the result is −300%, it means that the original wait time was increased by 300%, i.e., four times.
5.
The actual average wait times of the baseline solution were as follows: CERIT-SC_2015 (6.5 h), CERIT-SC_2017 (4.1 h), MetaCentrum_2013 (3.0 h), CERIT-SC_2013 (6.0 h), HPC2N (4.2 h), KTH SP2 (1.8 h), CTC SP2 (3.8 h) and SDSC SP2 (4.9 h). The average slowdowns of the baseline solution were following: CERIT-SC_2015 (249.9), CERIT-SC_2017 (127.9), MetaCentrum_2013 (115.5), CERIT-SC_2013 (620.7), HPC2N (143.2), KTH SP2 (105.8), CTC SP2 (49.9) and SDSC SP2 (72.8).
6.
With only few estimates used throughout the whole workload, backfilling has significantly decreased opportunity to fill these holes, because “most jobs look the same” and thus do not fit within available holes.

References

Alea 4: Job scheduling simulator, February 2018. https://github.com/aleasimulator
Balasundaram, V., Fox, G., Kennedy, K., Kremer, U.: A static performance estimator to guide data partitioning decisions. ACM SIGPLAN Not. 26(7), 213–223 (1991)
Article Google Scholar
CERIT Scientific Cloud, February 2018. http://www.cerit-sc.cz
Chiang, S.-H., Arpaci-Dusseau, A., Vernon, M.K.: The impact of more accurate requested runtimes on production job scheduling performance. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 103–127. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_7
Chapter MATH Google Scholar
Devarakonda, M.V., Iyer, R.K.: Predictability of process resource usage: a measurement based study on UNIX. IEEE Trans. Softw. Eng. 15(12), 1579–1586 (1989)
Article Google Scholar
Downey, A.B.: Predicting queue times on space-sharing parallel computers. In: 11th International Parallel Processing Symposium, pp. 209–218 (1997)
Google Scholar
Ernemann, C., Hamscher, V., Yahyapour, R.: Benefits of global Grid computing for job scheduling. In: GRID ’04: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing, pp. 374–379. IEEE (2004)
Google Scholar
Feitelson, D.G.: Parallel workloads archive, February 2018. http://www.cs.huji.ac.il/labs/parallel/workload/
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_14
Chapter Google Scholar
Feitelson, D.G., Weil, A.M.: Utilization and predictability in scheduling the IBM SP2 with backfilling. In: 12th International Parallel Processing Symposium, pp. 542–546. IEEE (1998)
Google Scholar
Guim, F., Corbalan, J., Labarta, J.: Prediction f based models for evaluating backfilling scheduling policies. In: Eighth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT 2007), pp. 9–17. IEEE (2007)
Google Scholar
Klusáček, D.: Workload traces from metacentrum and CERIT Scientific Cloud, February 2018. http://jsspp.org/workload/
Klusáček, D., Tóth, Š., Podolníková, G.: Complex job scheduling simulations with Alea 4. In: Ninth EAI International Conference on Simulation Tools and Techniques (SimuTools 2016), pp. 124–129. ACM (2016)
Google Scholar
Krakov, D., Feitelson, D.G.: Comparing performance heatmaps. In: Desai, N., Cirne, W. (eds.) JSSPP 2013. LNCS, vol. 8429, pp. 42–61. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-43779-7_3
Chapter Google Scholar
Kumar, R., Vadhiyar, S.: Prediction of queue waiting times for metascheduling on parallel batch systems. In: Cirne, W., Desai, N. (eds.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 8828, pp. 108–128. Springer (2014)
Google Scholar
Bailey Lee, C., Schwartzman, Y., Hardy, J., Snavely, A.: Are user runtime estimates inherently inaccurate? In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 253–263. Springer, Heidelberg (2005). https://doi.org/10.1007/11407522_14
Chapter Google Scholar
MetaCentrum, February 2018. http://www.metacentrum.cz/
Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)
Article Google Scholar
Nurmi, D., Brevik, J., Wolski, R.: QBETS: queue bounds estimation from time series. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2007. LNCS, vol. 4942, pp. 76–101. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78699-3_5
Chapter Google Scholar
PBS Works. PBS Professional 14.2, Administrator’s Guide, February 2018. http://www.pbsworks.com
Sarkar, V.: Determining average program execution times and their variance. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 298–312 (1989)
Google Scholar
Seneviratne, S., Witharana, S.: A survey on methodologies for runtime prediction on grid environments. In: 7th International Conference on Information and Automation for Sustainability, pp. 1–6. IEEE (2014)
Google Scholar
Skovira, J., Chan, W., Zhou, H., Lifka, D.: The EASY — LoadLeveler API project. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1996. LNCS, vol. 1162, pp. 41–47. Springer, Heidelberg (1996). https://doi.org/10.1007/BFb0022286
Chapter Google Scholar
Smith, W., Foster, I., Taylor, V.: Predicting application run times using historical information. In: Feitelson, D.G., Rudolph, L. (eds.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 1459, pp. 122–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Smith, W., Taylor, V., Foster, I.: Using run-time predictions to estimate queue wait times and improve scheduler performance. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999. LNCS, vol. 1659, pp. 202–219. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-47954-6_11
Chapter Google Scholar
Soft walltime documentation, February 2018. https://pbspro.atlassian.net/wiki/spaces/PD/pages/42532871/PP-482+Soft+Walltime
Talby, D., Feitelson, D.G.: Supporting priorities and improving utilization of the IBM SP scheduler using slack-based backfilling. In: IPPS 1999/SPDP 1999: Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing, pp. 513–517. IEEE Computer Society (1999)
Google Scholar
Tang, W., Desai, N., Buettner, D., Lan, Z.: Analyzing and adjusting user runtime estimates to improve job scheduling on the Blue Gene/P. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–11. IEEE (2010)
Google Scholar
Tsafrir, D.: Using inaccurate estimates accurately. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2010. LNCS, vol. 6253, pp. 208–221. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16505-4_12
Chapter Google Scholar
Tsafrir, D., Etsion, Y., Feitelson, D.G.: Modeling user runtime estimates. In: Feitelson, D., Frachtenberg, E., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2005. LNCS, vol. 3834, pp. 1–35. Springer, Heidelberg (2005). https://doi.org/10.1007/11605300_1
Chapter Google Scholar
Zakay, N., Feitelson, D.G.: Preserving user behavior characteristics in trace-based simulation of parallel job scheduling. In: 22nd Modeling, Analysis and Simulation of Computer and Telecommunications Systems (MASCOTS), pp. 51–60 (2014)
Google Scholar

Download references

Acknowledgments

We kindly acknowledge the support and computational resources provided by the MetaCentrum under the program LM2015042 and the CERIT Scientific Cloud under the program LM2015085, provided under the programme “Projects of Large Infrastructure for Research, Development, and Innovations” and the project Reg. No. CZ.02.1.01/0.0/0.0/16_013/0001797 co-funded by the Ministry of Education, Youth and Sports of the Czech Republic. We also highly appreciate the access to the workload traces provided by the Parallel Workloads Archive, MetaCentrum and CERIT-SC.

Author information

Authors and Affiliations

CESNET a.l.e., Brno, Czech Republic
Dalibor Klusáček & Václav Chlumský

Authors

Dalibor Klusáček
View author publications
You can also search for this author in PubMed Google Scholar
Václav Chlumský
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dalibor Klusáček .

Editor information

Editors and Affiliations

CESNET, Prague, Czech Republic
Dalibor Klusáček
Google, Mountain View, CA, USA
Walfredo Cirne
Google, Seattle, WA, USA
Narayan Desai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Klusáček, D., Chlumský, V. (2019). Evaluating the Impact of Soft Walltimes on Job Scheduling Performance. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2018. Lecture Notes in Computer Science(), vol 11332. Springer, Cham. https://doi.org/10.1007/978-3-030-10632-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-10632-4_2
Published: 13 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10631-7
Online ISBN: 978-3-030-10632-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics