Analysis of Job Metadata for Enhanced Wall Time Prediction

Soysal, Mehmet; Berghoff, Marco; Streit, Achim

doi:10.1007/978-3-030-10632-4_1

Mehmet Soysal¹⁵,
Marco Berghoff¹⁵ &
Achim Streit¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11332))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

468 Accesses
6 Citations

Abstract

For efficient utilization of large-scale HPC systems, the task of resource management and job scheduling is of highest priority. Therefore, modern job scheduling systems require information about the estimated total wall time of the jobs already at submission time. Proper wall time estimates are a key for reliable scheduling decisions. Typically, users specify these estimates, already at submission time, based on either previous knowledge or certain limits given by the system. Real-world experience shows that user given estimates are far away from accurate. Hence, an automated system is desirable that creates more precise wall time estimates of submitted jobs. In this paper, we investigate different job metadata and their impact on the wall time prediction. For the job wall time prediction, we used machine learning methods and the workload traces of large HPC systems. In contrast to previous work, we also consider the jobname and in particular the submission directory. Our evaluation shows that we can better predict the accuracy of jobs per user by a factor of seven than most users, without any in-depth analysis of the job.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hovestadt, M., Kao, O., Keller, A., Streit, A.: Scheduling in HPC resource management systems: queuing vs. planning. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 1–20. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_1
Chapter Google Scholar
Oeste, S., Kluge, M., Soysal, M., Streit, A., Vef, M., Brinkmann, A.: Exploring opportunities for job-temporal file systems with ada-fs. In: 1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems (2016)
Google Scholar
Gibbons, R.: A historical application profiler for use by parallel schedulers. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 58–77. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_16
Chapter Google Scholar
Downey, A.B.: Predicting queue times on space-sharing parallel computers. In: 11th International Proceedings on Parallel Processing Symposium, pp. 209–218. IEEE (1997)
Google Scholar
Gibbons, R.: A historical profiler for use by parallel schedulers. Master’s thesis, University of Toronto (1997)
Google Scholar
Smith, W., Foster, I., Taylor, V.: Predicting application run times using historical information. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1998. LNCS, vol. 1459, pp. 122–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0053984
Chapter Google Scholar
Smith, W., Taylor, V., Foster, I.: Using run-time predictions to estimate queue wait times and improve scheduler performance. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999. LNCS, vol. 1659, pp. 202–219. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-47954-6_11
Chapter Google Scholar
Matsunaga, A., AB Fortes, J.: On the use of machine learning to predict the time and resources consumed by applications. In: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp. 495–504. IEEE Computer Society (2010)
Google Scholar
Kapadia, N.H., AB Fortes, J.: On the design of a demand-based network-computing system: the purdue university network-computing hubs. In: Proceedings of the Seventh International Symposium on High Performance Distributed Computing, pp. 71–80. IEEE (1998)
Google Scholar
Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the ibm sp2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)
Article Google Scholar
Nadeem, F., Fahringer, T.: Using templates to predict execution time of scientific workflow applications in the grid. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 316–323. IEEE Computer Society (2009)
Google Scholar
Smith, W.: Prediction services for distributed computing. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS 2007, pp. 1–10. IEEE (2007)
Google Scholar
Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007)
Article Google Scholar
Xsede. https://www.xsede.org/
Karnak start/wait time predictions. http://karnak.xsede.org/karnak/index.html
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. MIT Press, Cambridge (2012)
MATH Google Scholar
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 2962–2970. Curran Associates Inc., New York (2015)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Buitinck, L., et al.: API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122 (2013)
Google Scholar
Parallel Workloads Archive. http://www.cs.huji.ac.il/labs/parallel/workload/
The Standard Workload Format. http://www.cs.huji.ac.il/labs/parallel/workload/swf.html
Forhlr i, kit/scc. https://www.scc.kit.edu/dienste/forhlr1.php
Forhlr ii, kit/scc. https://www.scc.kit.edu/dienste/forhlr2.php
Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)
Article Google Scholar
scikit - regression metrics. http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
scikit - r2 score. http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score
scikit - mean absolute error. http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error
scikit - median absolute error. http://scikit-learn.org/stable/modules/generated/sklearn.metrics.median_absolute_error.html#sklearn.metrics.median_absolute_error
scikit - datasset spliting
Google Scholar
scikit - model persistence. http://scikit-learn.org/stable/modules/model_persistence.html
Bellman, R.: Dynamic Programming. Courier Corporation, North Chelmsford (2013)
MATH Google Scholar
Hughes, G.: On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 14(1), 55–63 (1968)
Article Google Scholar
Pearson, K.: LIII. on lines and planes of closest fit to systems of points in space. Lond, Edinb, Dublin Philos. Mag. J. Sci. 2(11), 559–572 (1901)
Article Google Scholar

Download references

Acknowledgement

This work inside of the project ADA-FS is funded by the DFG Priority Program “Software for Exascale Computing” (SPPEXA, SPP 1648), which is gratefully acknowledged.

Author information

Authors and Affiliations

Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology (KIT), Hermann-von-Helmholtz-Platz 1, 76344, Eggenstein-Leopoldshafen, Germany
Mehmet Soysal, Marco Berghoff & Achim Streit

Authors

Mehmet Soysal
View author publications
You can also search for this author in PubMed Google Scholar
Marco Berghoff
View author publications
You can also search for this author in PubMed Google Scholar
Achim Streit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mehmet Soysal .

Editor information

Editors and Affiliations

CESNET, Prague, Czech Republic
Dalibor Klusáček
Google, Mountain View, CA, USA
Walfredo Cirne
Google, Seattle, WA, USA
Narayan Desai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Soysal, M., Berghoff, M., Streit, A. (2019). Analysis of Job Metadata for Enhanced Wall Time Prediction. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2018. Lecture Notes in Computer Science(), vol 11332. Springer, Cham. https://doi.org/10.1007/978-3-030-10632-4_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-10632-4_1
Published: 13 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10631-7
Online ISBN: 978-3-030-10632-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics