Skip to main content

Analysis of Job Metadata for Enhanced Wall Time Prediction

  • Conference paper
  • First Online:
Job Scheduling Strategies for Parallel Processing (JSSPP 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11332))

Included in the following conference series:

Abstract

For efficient utilization of large-scale HPC systems, the task of resource management and job scheduling is of highest priority. Therefore, modern job scheduling systems require information about the estimated total wall time of the jobs already at submission time. Proper wall time estimates are a key for reliable scheduling decisions. Typically, users specify these estimates, already at submission time, based on either previous knowledge or certain limits given by the system. Real-world experience shows that user given estimates are far away from accurate. Hence, an automated system is desirable that creates more precise wall time estimates of submitted jobs. In this paper, we investigate different job metadata and their impact on the wall time prediction. For the job wall time prediction, we used machine learning methods and the workload traces of large HPC systems. In contrast to previous work, we also consider the jobname and in particular the submission directory. Our evaluation shows that we can better predict the accuracy of jobs per user by a factor of seven than most users, without any in-depth analysis of the job.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hovestadt, M., Kao, O., Keller, A., Streit, A.: Scheduling in HPC resource management systems: queuing vs. planning. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 1–20. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_1

    Chapter  Google Scholar 

  2. Oeste, S., Kluge, M., Soysal, M., Streit, A., Vef, M., Brinkmann, A.: Exploring opportunities for job-temporal file systems with ada-fs. In: 1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems (2016)

    Google Scholar 

  3. Gibbons, R.: A historical application profiler for use by parallel schedulers. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 58–77. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_16

    Chapter  Google Scholar 

  4. Downey, A.B.: Predicting queue times on space-sharing parallel computers. In: 11th International Proceedings on Parallel Processing Symposium, pp. 209–218. IEEE (1997)

    Google Scholar 

  5. Gibbons, R.: A historical profiler for use by parallel schedulers. Master’s thesis, University of Toronto (1997)

    Google Scholar 

  6. Smith, W., Foster, I., Taylor, V.: Predicting application run times using historical information. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1998. LNCS, vol. 1459, pp. 122–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0053984

    Chapter  Google Scholar 

  7. Smith, W., Taylor, V., Foster, I.: Using run-time predictions to estimate queue wait times and improve scheduler performance. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999. LNCS, vol. 1659, pp. 202–219. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-47954-6_11

    Chapter  Google Scholar 

  8. Matsunaga, A., AB Fortes, J.: On the use of machine learning to predict the time and resources consumed by applications. In: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp. 495–504. IEEE Computer Society (2010)

    Google Scholar 

  9. Kapadia, N.H., AB Fortes, J.: On the design of a demand-based network-computing system: the purdue university network-computing hubs. In: Proceedings of the Seventh International Symposium on High Performance Distributed Computing, pp. 71–80. IEEE (1998)

    Google Scholar 

  10. Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the ibm sp2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)

    Article  Google Scholar 

  11. Nadeem, F., Fahringer, T.: Using templates to predict execution time of scientific workflow applications in the grid. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 316–323. IEEE Computer Society (2009)

    Google Scholar 

  12. Smith, W.: Prediction services for distributed computing. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS 2007, pp. 1–10. IEEE (2007)

    Google Scholar 

  13. Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007)

    Article  Google Scholar 

  14. Xsede. https://www.xsede.org/

  15. Karnak start/wait time predictions. http://karnak.xsede.org/karnak/index.html

  16. Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. MIT Press, Cambridge (2012)

    MATH  Google Scholar 

  17. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 2962–2970. Curran Associates Inc., New York (2015)

    Google Scholar 

  18. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  19. Buitinck, L., et al.: API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122 (2013)

    Google Scholar 

  20. Parallel Workloads Archive. http://www.cs.huji.ac.il/labs/parallel/workload/

  21. The Standard Workload Format. http://www.cs.huji.ac.il/labs/parallel/workload/swf.html

  22. Forhlr i, kit/scc. https://www.scc.kit.edu/dienste/forhlr1.php

  23. Forhlr ii, kit/scc. https://www.scc.kit.edu/dienste/forhlr2.php

  24. Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)

    Article  Google Scholar 

  25. scikit - regression metrics. http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

  26. scikit - r2 score. http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score

  27. scikit - mean absolute error. http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error

  28. scikit - median absolute error. http://scikit-learn.org/stable/modules/generated/sklearn.metrics.median_absolute_error.html#sklearn.metrics.median_absolute_error

  29. scikit - datasset spliting

    Google Scholar 

  30. scikit - model persistence. http://scikit-learn.org/stable/modules/model_persistence.html

  31. Bellman, R.: Dynamic Programming. Courier Corporation, North Chelmsford (2013)

    MATH  Google Scholar 

  32. Hughes, G.: On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 14(1), 55–63 (1968)

    Article  Google Scholar 

  33. Pearson, K.: LIII. on lines and planes of closest fit to systems of points in space. Lond, Edinb, Dublin Philos. Mag. J. Sci. 2(11), 559–572 (1901)

    Article  Google Scholar 

Download references

Acknowledgement

This work inside of the project ADA-FS is funded by the DFG Priority Program “Software for Exascale Computing” (SPPEXA, SPP 1648), which is gratefully acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mehmet Soysal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Soysal, M., Berghoff, M., Streit, A. (2019). Analysis of Job Metadata for Enhanced Wall Time Prediction. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2018. Lecture Notes in Computer Science(), vol 11332. Springer, Cham. https://doi.org/10.1007/978-3-030-10632-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-10632-4_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-10631-7

  • Online ISBN: 978-3-030-10632-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics