Plan-Based Job Scheduling for Supercomputers with Shared Burst Buffers

Kopanski, Jan; Rzadca, Krzysztof

doi:10.1007/978-3-030-85665-6_8

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12820))

Included in the following conference series:

European Conference on Parallel Processing

1873 Accesses
1 Altmetric

Abstract

The ever-increasing gap between compute and I/O performance in HPC platforms, together with the development of novel NVMe storage devices (NVRAM), led to the emergence of the burst buffer concept—an intermediate persistent storage layer logically positioned between random-access main memory and a parallel file system. Despite the development of real-world architectures as well as research concepts, resource and job management systems, such as Slurm, provide only marginal support for scheduling jobs with burst buffer requirements, in particular ignoring burst buffers when backfilling. We investigate the impact of burst buffer reservations on the overall efficiency of online job scheduling for common algorithms: First-Come-First-Served (FCFS) and Shortest-Job-First (SJF) EASY-backfilling. We evaluate the algorithms in a detailed simulation with I/O side effects. Our results indicate that the lack of burst buffer reservations in backfilling may significantly deteriorate scheduling. We also show that these algorithms can be easily extended to support burst buffers. Finally, we propose a burst-buffer–aware plan-based scheduling algorithm with simulated annealing optimisation, which improves the mean waiting time by over 20% and mean bounded slowdown by 27% compared to the burst-buffer–aware SJF-EASY-backfilling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

GekkoFS — A Temporary Burst Buffer File System for HPC Applications

Article 17 January 2020

Analyzing the Performance of Allocation Strategies Based on Space-Filling Curves

Decentralized Preemptive Scheduling Across Heterogeneous Multi-core Grid Resources

References

Ben-Ameur, W.: Computing the initial temperature of simulated annealing. Comput. Optim. Appl. 29, 369–385 (2004). https://doi.org/10.1023/B:COAP.0000044187.23143.bd
Article MathSciNet MATH Google Scholar
Bhimji, W., et al.: Accelerating science with the NERSC burst buffer early user program (2016)
Google Scholar
Carastan-Santos, D., De Camargo, R.Y., Trystram, D., Zrigui, S.: One can only gain by replacing EASY Backfilling: a simple scheduling policies case study. In: CCGrid. IEEE (2019)
Google Scholar
Casanova, H., Giersch, A., Legrand, A., Quinson, M., Suter, F.: Versatile, scalable, and accurate simulation of distributed applications and platforms. J. Parallel Distrib. Comput. 74, 2899–2917 (2014)
Article Google Scholar
Dongarra, J.: Report on the Fujitsu Fugaku system. Tech. Rep. ICL-UT-20-06, University of Tennessee, June 2020
Google Scholar
Dutot, P.F., Mercier, M., Poquet, M., Richard, O.: Batsim: a realistic language-independent resources and jobs management systems simulator. In: JSSPP Workshop (2016)
Google Scholar
Fan, Y., et al.: Scheduling beyond CPUs for HPC. In: HPDC 2019. ACM (2019)
Google Scholar
Feitelson, D.G.: Experimental analysis of the root causes of performance evaluation results: a backfilling case study. TPDS 16, 175–182 (2005)
Google Scholar
Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74, 1982–2967 (2014)
Article Google Scholar
Gainaru, A., Aupy, G., Benoit, A., Cappello, F., Robert, Y., Snir, M.: Scheduling the I/O of HPC applications under congestion. In: IPDPS, Proceedings IEEE (2015)
Google Scholar
Harms, K., Oral, H.S., Atchley, S., Vazhkudai, S.S.: Impact of burst buffer architectures on application portability (2016)
Google Scholar
Hemmert, K.S., et al.: Trinity: architecture and early experience (2016)
Google Scholar
Herbein, S., et al.: Scalable I/O-aware job scheduling for burst buffer enabled HPC clusters. In: HPDC 2016. ACM (2016)
Google Scholar
Hofmann, H., Wickham, H., Kafadar, K.: Letter-value plots: boxplots for large data. J. Comput. Graph. Stat. 26, 469–477 (2017)
Article MathSciNet Google Scholar
Hovestadt, M., Kao, O., Keller, A., Streit, A.: Scheduling in HPC resource management systems: queuing vs. planning. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 1–20. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_1
Chapter Google Scholar
Isakov, M., et al.: HPC I/O throughput bottleneck analysis with explainable local models. In: SC20. IEEE (2020)
Google Scholar
Klusáček, D., Tóth, Š, Podolníková, G.: Real-life experience with major reconfiguration of job scheduling system. In: Desai, N., Cirne, W. (eds.) JSSPP, pp. 83–101. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-61756-5_5
Chapter Google Scholar
Kopanski, J., Rzadca, K.: Artifact and instructions to generate experimental results for the Euro-par 2021 paper: plan-based job scheduling for supercomputers with shared burst buffers, August 2021. https://doi.org/10.6084/m9.figshare.14754507
Lackner, L.E., Fard, H.M., Wolf, F.: Efficient job scheduling for clusters with shared tiered storage. In: CCGRID, IEEE/ACM, pp. 321–330 (2019)
Google Scholar
Liu, N., et al.: On the role of burst buffers in leadership-class storage systems. In: MSST, Proceedings IEEE (2012)
Google Scholar
Poquet, M.: Simulation approach for resource management. Theses, Université Grenoble Alpes, December 2017
Google Scholar
RIKEN Center for Computational Science: Post-k (fugaku) information (2020). https://postk-web.r-ccs.riken.jp/spec.html. Accessed 04 Aug 2020
Srinivasan, S., Kettimuthu, R., Subramani, V., Sadayappan, P.: Selective reservation strategies for backfill job scheduling. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 55–71. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_4
Chapter MATH Google Scholar
Vazhkudai, S.S., et al.: The design, deployment, and evaluation of the coral pre-exascale systems. In: SC18, Proceedings IEEE (2018)
Google Scholar
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple Linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_3
Chapter Google Scholar
Zheng, X., Zhou, Z., Yang, X., Lan, Z., Wang, J.: Exploring plan-based scheduling for large-scale computing systems. In: CLUSTER, Proceedings IEEE (2016)
Google Scholar
Zhou, Z., et al.: I/O-aware batch scheduling for petascale computing systems. In: CLUSTER. IEEE (2015)
Google Scholar

Download references

Acknowledgements

This research is supported by a Polish National Science Center grant Opus (UMO-2017/25/B/ST6/00116).

The MetaCentrum workload log [17] was graciously provided by Czech National Grid Infrastructure MetaCentrum. The workload log from the KTH SP2 was graciously provided by Lars Malinowsky.

Author information

Authors and Affiliations

Institute of Informatics, University of Warsaw, Stefana Banacha 2, 02-097, Warsaw, Poland
Jan Kopanski & Krzysztof Rzadca

Authors

Jan Kopanski
View author publications
You can also search for this author in PubMed Google Scholar
Krzysztof Rzadca
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jan Kopanski or Krzysztof Rzadca .

Editor information

Editors and Affiliations

Universidade de Lisboa, Lisbon, Portugal
Leonel Sousa
Universidade de Lisboa, Lisbon, Portugal
Nuno Roma
Universidade de Lisboa, Lisbon, Portugal
Pedro Tomás

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kopanski, J., Rzadca, K. (2021). Plan-Based Job Scheduling for Supercomputers with Shared Burst Buffers. In: Sousa, L., Roma, N., Tomás, P. (eds) Euro-Par 2021: Parallel Processing. Euro-Par 2021. Lecture Notes in Computer Science(), vol 12820. Springer, Cham. https://doi.org/10.1007/978-3-030-85665-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-85665-6_8
Published: 25 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85664-9
Online ISBN: 978-3-030-85665-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Plan-Based Job Scheduling for Supercomputers with Shared Burst Buffers

Abstract

Access this chapter

Similar content being viewed by others

GekkoFS — A Temporary Burst Buffer File System for HPC Applications

Analyzing the Performance of Allocation Strategies Based on Space-Filling Curves

Decentralized Preemptive Scheduling Across Heterogeneous Multi-core Grid Resources

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Plan-Based Job Scheduling for Supercomputers with Shared Burst Buffers

Abstract

Access this chapter

Similar content being viewed by others

GekkoFS — A Temporary Burst Buffer File System for HPC Applications

Analyzing the Performance of Allocation Strategies Based on Space-Filling Curves

Decentralized Preemptive Scheduling Across Heterogeneous Multi-core Grid Resources

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation