Skip to main content

Evaluating the Impact of Soft Walltimes on Job Scheduling Performance

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11332))

Abstract

For two decades researchers have been analyzing the impact of inaccurate job walltime (runtime) estimates on the performance of job scheduling algorithms, especially in case of backfilling. Several studies analyzed the pros and cons of using accurate vs. inaccurate estimates. Some researchers focused on the ways users of the system can be motivated to provide more accurate runtime estimates. The recent addition of so-called “soft walltime” parameter in the widely used PBS Professional enables a system administrator to actually use some of these techniques to refine user-provided walltime estimates. The obvious question of a system administrator is whether such walltime predictions are useful and “safe” and what will be the impact on the overall system performance. In this work, we use several detailed simulations to analyze the actual impact of using soft walltimes in a job scheduler, discussing the scenarios when such “refined” estimates can be meaningfully used.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    In case that a given user has no completed jobs so far then such historic information is obviously missing, thus we use the user-provided estimate instead.

  2. 2.

    The system is configured to kill a job if it exceeds user’s walltime estimate.

  3. 3.

    Unlike in the static scenario, user-oriented analysis makes a great sense when the workload is dynamically adapted.

  4. 4.

    For example, if the result is 25% it means that, e.g., the original wait time was decreased by 25%. On the other hand, if the result is −300%, it means that the original wait time was increased by 300%, i.e., four times.

  5. 5.

    The actual average wait times of the baseline solution were as follows: CERIT-SC_2015 (6.5 h), CERIT-SC_2017 (4.1 h), MetaCentrum_2013 (3.0 h), CERIT-SC_2013 (6.0 h), HPC2N (4.2 h), KTH SP2 (1.8 h), CTC SP2 (3.8 h) and SDSC SP2 (4.9 h). The average slowdowns of the baseline solution were following: CERIT-SC_2015 (249.9), CERIT-SC_2017 (127.9), MetaCentrum_2013 (115.5), CERIT-SC_2013 (620.7), HPC2N (143.2), KTH SP2 (105.8), CTC SP2 (49.9) and SDSC SP2 (72.8).

  6. 6.

    With only few estimates used throughout the whole workload, backfilling has significantly decreased opportunity to fill these holes, because “most jobs look the same” and thus do not fit within available holes.

References

  1. Alea 4: Job scheduling simulator, February 2018. https://github.com/aleasimulator

  2. Balasundaram, V., Fox, G., Kennedy, K., Kremer, U.: A static performance estimator to guide data partitioning decisions. ACM SIGPLAN Not. 26(7), 213–223 (1991)

    Article  Google Scholar 

  3. CERIT Scientific Cloud, February 2018. http://www.cerit-sc.cz

  4. Chiang, S.-H., Arpaci-Dusseau, A., Vernon, M.K.: The impact of more accurate requested runtimes on production job scheduling performance. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 103–127. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_7

    Chapter  MATH  Google Scholar 

  5. Devarakonda, M.V., Iyer, R.K.: Predictability of process resource usage: a measurement based study on UNIX. IEEE Trans. Softw. Eng. 15(12), 1579–1586 (1989)

    Article  Google Scholar 

  6. Downey, A.B.: Predicting queue times on space-sharing parallel computers. In: 11th International Parallel Processing Symposium, pp. 209–218 (1997)

    Google Scholar 

  7. Ernemann, C., Hamscher, V., Yahyapour, R.: Benefits of global Grid computing for job scheduling. In: GRID ’04: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing, pp. 374–379. IEEE (2004)

    Google Scholar 

  8. Feitelson, D.G.: Parallel workloads archive, February 2018. http://www.cs.huji.ac.il/labs/parallel/workload/

  9. Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_14

    Chapter  Google Scholar 

  10. Feitelson, D.G., Weil, A.M.: Utilization and predictability in scheduling the IBM SP2 with backfilling. In: 12th International Parallel Processing Symposium, pp. 542–546. IEEE (1998)

    Google Scholar 

  11. Guim, F., Corbalan, J., Labarta, J.: Prediction f based models for evaluating backfilling scheduling policies. In: Eighth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT 2007), pp. 9–17. IEEE (2007)

    Google Scholar 

  12. Klusáček, D.: Workload traces from metacentrum and CERIT Scientific Cloud, February 2018. http://jsspp.org/workload/

  13. Klusáček, D., Tóth, Š., Podolníková, G.: Complex job scheduling simulations with Alea 4. In: Ninth EAI International Conference on Simulation Tools and Techniques (SimuTools 2016), pp. 124–129. ACM (2016)

    Google Scholar 

  14. Krakov, D., Feitelson, D.G.: Comparing performance heatmaps. In: Desai, N., Cirne, W. (eds.) JSSPP 2013. LNCS, vol. 8429, pp. 42–61. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-43779-7_3

    Chapter  Google Scholar 

  15. Kumar, R., Vadhiyar, S.: Prediction of queue waiting times for metascheduling on parallel batch systems. In: Cirne, W., Desai, N. (eds.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 8828, pp. 108–128. Springer (2014)

    Google Scholar 

  16. Bailey Lee, C., Schwartzman, Y., Hardy, J., Snavely, A.: Are user runtime estimates inherently inaccurate? In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 253–263. Springer, Heidelberg (2005). https://doi.org/10.1007/11407522_14

    Chapter  Google Scholar 

  17. MetaCentrum, February 2018. http://www.metacentrum.cz/

  18. Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)

    Article  Google Scholar 

  19. Nurmi, D., Brevik, J., Wolski, R.: QBETS: queue bounds estimation from time series. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2007. LNCS, vol. 4942, pp. 76–101. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78699-3_5

    Chapter  Google Scholar 

  20. PBS Works. PBS Professional 14.2, Administrator’s Guide, February 2018. http://www.pbsworks.com

  21. Sarkar, V.: Determining average program execution times and their variance. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 298–312 (1989)

    Google Scholar 

  22. Seneviratne, S., Witharana, S.: A survey on methodologies for runtime prediction on grid environments. In: 7th International Conference on Information and Automation for Sustainability, pp. 1–6. IEEE (2014)

    Google Scholar 

  23. Skovira, J., Chan, W., Zhou, H., Lifka, D.: The EASY — LoadLeveler API project. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1996. LNCS, vol. 1162, pp. 41–47. Springer, Heidelberg (1996). https://doi.org/10.1007/BFb0022286

    Chapter  Google Scholar 

  24. Smith, W., Foster, I., Taylor, V.: Predicting application run times using historical information. In: Feitelson, D.G., Rudolph, L. (eds.) Job Scheduling Strategies for Parallel Processing. LNCS, vol. 1459, pp. 122–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  25. Smith, W., Taylor, V., Foster, I.: Using run-time predictions to estimate queue wait times and improve scheduler performance. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999. LNCS, vol. 1659, pp. 202–219. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-47954-6_11

    Chapter  Google Scholar 

  26. Soft walltime documentation, February 2018. https://pbspro.atlassian.net/wiki/spaces/PD/pages/42532871/PP-482+Soft+Walltime

  27. Talby, D., Feitelson, D.G.: Supporting priorities and improving utilization of the IBM SP scheduler using slack-based backfilling. In: IPPS 1999/SPDP 1999: Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing, pp. 513–517. IEEE Computer Society (1999)

    Google Scholar 

  28. Tang, W., Desai, N., Buettner, D., Lan, Z.: Analyzing and adjusting user runtime estimates to improve job scheduling on the Blue Gene/P. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–11. IEEE (2010)

    Google Scholar 

  29. Tsafrir, D.: Using inaccurate estimates accurately. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2010. LNCS, vol. 6253, pp. 208–221. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16505-4_12

    Chapter  Google Scholar 

  30. Tsafrir, D., Etsion, Y., Feitelson, D.G.: Modeling user runtime estimates. In: Feitelson, D., Frachtenberg, E., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2005. LNCS, vol. 3834, pp. 1–35. Springer, Heidelberg (2005). https://doi.org/10.1007/11605300_1

    Chapter  Google Scholar 

  31. Zakay, N., Feitelson, D.G.: Preserving user behavior characteristics in trace-based simulation of parallel job scheduling. In: 22nd Modeling, Analysis and Simulation of Computer and Telecommunications Systems (MASCOTS), pp. 51–60 (2014)

    Google Scholar 

Download references

Acknowledgments

We kindly acknowledge the support and computational resources provided by the MetaCentrum under the program LM2015042 and the CERIT Scientific Cloud under the program LM2015085, provided under the programme “Projects of Large Infrastructure for Research, Development, and Innovations” and the project Reg. No. CZ.02.1.01/0.0/0.0/16_013/0001797 co-funded by the Ministry of Education, Youth and Sports of the Czech Republic. We also highly appreciate the access to the workload traces provided by the Parallel Workloads Archive, MetaCentrum and CERIT-SC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dalibor Klusáček .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Klusáček, D., Chlumský, V. (2019). Evaluating the Impact of Soft Walltimes on Job Scheduling Performance. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2018. Lecture Notes in Computer Science(), vol 11332. Springer, Cham. https://doi.org/10.1007/978-3-030-10632-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-10632-4_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-10631-7

  • Online ISBN: 978-3-030-10632-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics