Advertisement

Analysis of Job Metadata for Enhanced Wall Time Prediction

  • Mehmet SoysalEmail author
  • Marco Berghoff
  • Achim Streit
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11332)

Abstract

For efficient utilization of large-scale HPC systems, the task of resource management and job scheduling is of highest priority. Therefore, modern job scheduling systems require information about the estimated total wall time of the jobs already at submission time. Proper wall time estimates are a key for reliable scheduling decisions. Typically, users specify these estimates, already at submission time, based on either previous knowledge or certain limits given by the system. Real-world experience shows that user given estimates are far away from accurate. Hence, an automated system is desirable that creates more precise wall time estimates of submitted jobs. In this paper, we investigate different job metadata and their impact on the wall time prediction. For the job wall time prediction, we used machine learning methods and the workload traces of large HPC systems. In contrast to previous work, we also consider the jobname and in particular the submission directory. Our evaluation shows that we can better predict the accuracy of jobs per user by a factor of seven than most users, without any in-depth analysis of the job.

Notes

Acknowledgement

This work inside of the project ADA-FS is funded by the DFG Priority Program “Software for Exascale Computing” (SPPEXA, SPP 1648), which is gratefully acknowledged.

References

  1. 1.
    Hovestadt, M., Kao, O., Keller, A., Streit, A.: Scheduling in HPC resource management systems: queuing vs. planning. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 1–20. Springer, Heidelberg (2003).  https://doi.org/10.1007/10968987_1CrossRefGoogle Scholar
  2. 2.
    Oeste, S., Kluge, M., Soysal, M., Streit, A., Vef, M., Brinkmann, A.: Exploring opportunities for job-temporal file systems with ada-fs. In: 1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems (2016)Google Scholar
  3. 3.
    Gibbons, R.: A historical application profiler for use by parallel schedulers. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 58–77. Springer, Heidelberg (1997).  https://doi.org/10.1007/3-540-63574-2_16CrossRefGoogle Scholar
  4. 4.
    Downey, A.B.: Predicting queue times on space-sharing parallel computers. In: 11th International Proceedings on Parallel Processing Symposium, pp. 209–218. IEEE (1997)Google Scholar
  5. 5.
    Gibbons, R.: A historical profiler for use by parallel schedulers. Master’s thesis, University of Toronto (1997)Google Scholar
  6. 6.
    Smith, W., Foster, I., Taylor, V.: Predicting application run times using historical information. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1998. LNCS, vol. 1459, pp. 122–142. Springer, Heidelberg (1998).  https://doi.org/10.1007/BFb0053984CrossRefGoogle Scholar
  7. 7.
    Smith, W., Taylor, V., Foster, I.: Using run-time predictions to estimate queue wait times and improve scheduler performance. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999. LNCS, vol. 1659, pp. 202–219. Springer, Heidelberg (1999).  https://doi.org/10.1007/3-540-47954-6_11CrossRefGoogle Scholar
  8. 8.
    Matsunaga, A., AB Fortes, J.: On the use of machine learning to predict the time and resources consumed by applications. In: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp. 495–504. IEEE Computer Society (2010)Google Scholar
  9. 9.
    Kapadia, N.H., AB Fortes, J.: On the design of a demand-based network-computing system: the purdue university network-computing hubs. In: Proceedings of the Seventh International Symposium on High Performance Distributed Computing, pp. 71–80. IEEE (1998)Google Scholar
  10. 10.
    Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the ibm sp2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)CrossRefGoogle Scholar
  11. 11.
    Nadeem, F., Fahringer, T.: Using templates to predict execution time of scientific workflow applications in the grid. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 316–323. IEEE Computer Society (2009)Google Scholar
  12. 12.
    Smith, W.: Prediction services for distributed computing. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS 2007, pp. 1–10. IEEE (2007)Google Scholar
  13. 13.
    Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007)CrossRefGoogle Scholar
  14. 14.
  15. 15.
    Karnak start/wait time predictions. http://karnak.xsede.org/karnak/index.html
  16. 16.
    Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. MIT Press, Cambridge (2012)zbMATHGoogle Scholar
  17. 17.
    Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 2962–2970. Curran Associates Inc., New York (2015)Google Scholar
  18. 18.
    Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  19. 19.
    Buitinck, L., et al.: API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108–122 (2013)Google Scholar
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
    Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)CrossRefGoogle Scholar
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
    scikit - datasset splitingGoogle Scholar
  30. 30.
  31. 31.
    Bellman, R.: Dynamic Programming. Courier Corporation, North Chelmsford (2013)zbMATHGoogle Scholar
  32. 32.
    Hughes, G.: On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 14(1), 55–63 (1968)CrossRefGoogle Scholar
  33. 33.
    Pearson, K.: LIII. on lines and planes of closest fit to systems of points in space. Lond, Edinb, Dublin Philos. Mag. J. Sci. 2(11), 559–572 (1901)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Steinbuch Centre for Computing (SCC)Karlsruhe Institute of Technology (KIT)Eggenstein-LeopoldshafenGermany

Personalised recommendations