Skip to main content

Plan-Based Job Scheduling for Supercomputers with Shared Burst Buffers

  • Conference paper
  • First Online:
Euro-Par 2021: Parallel Processing (Euro-Par 2021)

Abstract

The ever-increasing gap between compute and I/O performance in HPC platforms, together with the development of novel NVMe storage devices (NVRAM), led to the emergence of the burst buffer concept—an intermediate persistent storage layer logically positioned between random-access main memory and a parallel file system. Despite the development of real-world architectures as well as research concepts, resource and job management systems, such as Slurm, provide only marginal support for scheduling jobs with burst buffer requirements, in particular ignoring burst buffers when backfilling. We investigate the impact of burst buffer reservations on the overall efficiency of online job scheduling for common algorithms: First-Come-First-Served (FCFS) and Shortest-Job-First (SJF) EASY-backfilling. We evaluate the algorithms in a detailed simulation with I/O side effects. Our results indicate that the lack of burst buffer reservations in backfilling may significantly deteriorate scheduling. We also show that these algorithms can be easily extended to support burst buffers. Finally, we propose a burst-buffer–aware plan-based scheduling algorithm with simulated annealing optimisation, which improves the mean waiting time by over 20% and mean bounded slowdown by 27% compared to the burst-buffer–aware SJF-EASY-backfilling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ben-Ameur, W.: Computing the initial temperature of simulated annealing. Comput. Optim. Appl. 29, 369–385 (2004). https://doi.org/10.1023/B:COAP.0000044187.23143.bd

    Article  MathSciNet  MATH  Google Scholar 

  2. Bhimji, W., et al.: Accelerating science with the NERSC burst buffer early user program (2016)

    Google Scholar 

  3. Carastan-Santos, D., De Camargo, R.Y., Trystram, D., Zrigui, S.: One can only gain by replacing EASY Backfilling: a simple scheduling policies case study. In: CCGrid. IEEE (2019)

    Google Scholar 

  4. Casanova, H., Giersch, A., Legrand, A., Quinson, M., Suter, F.: Versatile, scalable, and accurate simulation of distributed applications and platforms. J. Parallel Distrib. Comput. 74, 2899–2917 (2014)

    Article  Google Scholar 

  5. Dongarra, J.: Report on the Fujitsu Fugaku system. Tech. Rep. ICL-UT-20-06, University of Tennessee, June 2020

    Google Scholar 

  6. Dutot, P.F., Mercier, M., Poquet, M., Richard, O.: Batsim: a realistic language-independent resources and jobs management systems simulator. In: JSSPP Workshop (2016)

    Google Scholar 

  7. Fan, Y., et al.: Scheduling beyond CPUs for HPC. In: HPDC 2019. ACM (2019)

    Google Scholar 

  8. Feitelson, D.G.: Experimental analysis of the root causes of performance evaluation results: a backfilling case study. TPDS 16, 175–182 (2005)

    Google Scholar 

  9. Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74, 1982–2967 (2014)

    Article  Google Scholar 

  10. Gainaru, A., Aupy, G., Benoit, A., Cappello, F., Robert, Y., Snir, M.: Scheduling the I/O of HPC applications under congestion. In: IPDPS, Proceedings IEEE (2015)

    Google Scholar 

  11. Harms, K., Oral, H.S., Atchley, S., Vazhkudai, S.S.: Impact of burst buffer architectures on application portability (2016)

    Google Scholar 

  12. Hemmert, K.S., et al.: Trinity: architecture and early experience (2016)

    Google Scholar 

  13. Herbein, S., et al.: Scalable I/O-aware job scheduling for burst buffer enabled HPC clusters. In: HPDC 2016. ACM (2016)

    Google Scholar 

  14. Hofmann, H., Wickham, H., Kafadar, K.: Letter-value plots: boxplots for large data. J. Comput. Graph. Stat. 26, 469–477 (2017)

    Article  MathSciNet  Google Scholar 

  15. Hovestadt, M., Kao, O., Keller, A., Streit, A.: Scheduling in HPC resource management systems: queuing vs. planning. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 1–20. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_1

    Chapter  Google Scholar 

  16. Isakov, M., et al.: HPC I/O throughput bottleneck analysis with explainable local models. In: SC20. IEEE (2020)

    Google Scholar 

  17. Klusáček, D., Tóth, Š, Podolníková, G.: Real-life experience with major reconfiguration of job scheduling system. In: Desai, N., Cirne, W. (eds.) JSSPP, pp. 83–101. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-61756-5_5

    Chapter  Google Scholar 

  18. Kopanski, J., Rzadca, K.: Artifact and instructions to generate experimental results for the Euro-par 2021 paper: plan-based job scheduling for supercomputers with shared burst buffers, August 2021. https://doi.org/10.6084/m9.figshare.14754507

  19. Lackner, L.E., Fard, H.M., Wolf, F.: Efficient job scheduling for clusters with shared tiered storage. In: CCGRID, IEEE/ACM, pp. 321–330 (2019)

    Google Scholar 

  20. Liu, N., et al.: On the role of burst buffers in leadership-class storage systems. In: MSST, Proceedings IEEE (2012)

    Google Scholar 

  21. Poquet, M.: Simulation approach for resource management. Theses, Université Grenoble Alpes, December 2017

    Google Scholar 

  22. RIKEN Center for Computational Science: Post-k (fugaku) information (2020). https://postk-web.r-ccs.riken.jp/spec.html. Accessed 04 Aug 2020

  23. Srinivasan, S., Kettimuthu, R., Subramani, V., Sadayappan, P.: Selective reservation strategies for backfill job scheduling. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 55–71. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_4

    Chapter  MATH  Google Scholar 

  24. Vazhkudai, S.S., et al.: The design, deployment, and evaluation of the coral pre-exascale systems. In: SC18, Proceedings IEEE (2018)

    Google Scholar 

  25. Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple Linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_3

    Chapter  Google Scholar 

  26. Zheng, X., Zhou, Z., Yang, X., Lan, Z., Wang, J.: Exploring plan-based scheduling for large-scale computing systems. In: CLUSTER, Proceedings IEEE (2016)

    Google Scholar 

  27. Zhou, Z., et al.: I/O-aware batch scheduling for petascale computing systems. In: CLUSTER. IEEE (2015)

    Google Scholar 

Download references

Acknowledgements

This research is supported by a Polish National Science Center grant Opus (UMO-2017/25/B/ST6/00116).

The MetaCentrum workload log [17] was graciously provided by Czech National Grid Infrastructure MetaCentrum. The workload log from the KTH SP2 was graciously provided by Lars Malinowsky.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jan Kopanski or Krzysztof Rzadca .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kopanski, J., Rzadca, K. (2021). Plan-Based Job Scheduling for Supercomputers with Shared Burst Buffers. In: Sousa, L., Roma, N., Tomás, P. (eds) Euro-Par 2021: Parallel Processing. Euro-Par 2021. Lecture Notes in Computer Science(), vol 12820. Springer, Cham. https://doi.org/10.1007/978-3-030-85665-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-85665-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85664-9

  • Online ISBN: 978-3-030-85665-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics