Reducing the Human-in-the-Loop Component of the Scheduling of Large HTC Workloads

Azevedo, Frédéric; Gombert, Luc; Suter, Frédéric

doi:10.1007/978-3-030-10632-4_3

Frédéric Azevedo¹⁵,
Luc Gombert¹⁵ &
Frédéric Suter¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11332))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

350 Accesses
2 Citations

Abstract

A common characteristic to major physics experiments is an ever increasing need of computing resources to process experimental data and generate simulated data. The IN2P3 Computing Center provides its 2,500 users with about 35,000 cores and processes millions of jobs every month. This workload is composed of a vast majority of sequential jobs that corresponds to Monte-Carlo simulations and related analysis made on data produced on the Large Hadron Collider at CERN.

To schedule such a workload under specific constraints, the CC-IN2P3 relied for 20 years on an in-house job and resource management system complemented by an operation team who can directly act on the decisions made by the job scheduler and modify them. This system has been replaced in 2011 but legacy rules of thumb remained. Combined to other rules motivated by production constraints, they may act against the job scheduler optimizations and force the operators to apply more corrective actions than they should.

In this experience report from a production system, we describe the decisions made since the end of 2016 to either transfer some of the actions done by operators to the job scheduler or make these actions become unnecessary. The physical partitioning of resources in distinct pools has been replaced by a logical partitioning that leverages scheduling queues. Then some historical constraints, such as quotas, have been relaxed. For instance, the number of concurrent jobs from a given user group allowed to access a specific resource, e.g., a storage subsystem, has been progressively increased. Finally, the computation of the fair-share by the job scheduler has been modified to be less detrimental to small groups whose jobs have a low priority. The preliminary but promising results coming from these modifications constitute the beginning of a long-term activity to change the operation procedures applied to the computing infrastructure of the IN2P3 Computing Center.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chapin, S.J., et al.: Benchmarks and standards for the evaluation of parallel job schedulers. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1999. LNCS, vol. 1659, pp. 67–90. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-47954-6_4
Chapter Google Scholar
Dutot, P.-F., Mercier, M., Poquet, M., Richard, O.: Batsim: a realistic language-independent resources and jobs management systems simulator. In: Desai, N., Cirne, W. (eds.) JSSPP 2015-2016. LNCS, vol. 10353, pp. 178–197. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61756-5_10
Chapter Google Scholar
Feitelson, D., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)
Article Google Scholar
Jackson, D., Snell, Q., Clement, M.: Core algorithms of the Maui scheduler. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 87–102. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45540-X_6
Chapter Google Scholar
Kay, J., Lauder, P.: A fair share scheduler. Commun. ACM 31(1), 44–55 (1988)
Article Google Scholar
Klusáček, D., Tóth, Š.: On interactions among scheduling policies: finding efficient queue setup using high-resolution simulations. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014. LNCS, vol. 8632, pp. 138–149. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09873-9_12
Chapter Google Scholar
Klusáček, D., Tóth, Š., Podolníková, G.: Real-life experience with major reconfiguration of job scheduling system. In: Desai, N., Cirne, W. (eds.) JSSPP 2015-2016. LNCS, vol. 10353, pp. 83–101. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61756-5_5
Chapter Google Scholar
Klusáček, D., Tóth, V., Podolníková, G.: Complex Job Scheduling Simulations with Alea 4. In: Proceedings of the 9th EAI International Conference on Simulation Tools and Techniques (Simutools 2016), pp. 124–129. ICST, Prague (2016)
Google Scholar
Michelotto, M., et al.: A comparison of HEP code with SPEC 1 benchmarks on multi-core worker nodes. J. Phys. Conf. Ser. 219(5), 052009 (2010)
Article Google Scholar
The ATLAS Collaboration: Observation of a new particle in the search for the standard model Higgs Boson with the ATLAS detector at the LHC. Phys. Lett. B 716(1), 1–29 (2012). https://doi.org/10.1016/j.physletb.2012.08.020
Article Google Scholar
The CMS Collaboration: Observation of a New Boson at a Mass of 125 GeV with the CMS experiment at the LHC. Phys. Lett. B 716(1), 30–61 (2012). https://doi.org/10.1016/j.physletb.2012.08.021
Article Google Scholar
The IN2P3/CNRS Computing Center. http://cc.in2p3.fr/en/
The LIGO Scientific Collaboration and Virgo Collaboration: Observation of gravitational waves from a binary black hole merger. Phys. Rev. Lett. 116, 061102 (2016). https://doi.org/10.1103/PhysRevLett.116.061102
Article MathSciNet Google Scholar
Univa Corporation: Grid Engine. http://www.univa.com/products/

Download references

Acknowledgements

The authors would like to thank the members of the Operation and Applications teams of the CC-IN2P3 for their help in the preparation of this experience report.

Author information

Authors and Affiliations

IN2P3 Computing Center, CNRS, Lyon-Villeurbanne, France
Frédéric Azevedo, Luc Gombert & Frédéric Suter

Authors

Frédéric Azevedo
View author publications
You can also search for this author in PubMed Google Scholar
Luc Gombert
View author publications
You can also search for this author in PubMed Google Scholar
Frédéric Suter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Frédéric Suter .

Editor information

Editors and Affiliations

CESNET, Prague, Czech Republic
Dalibor Klusáček
Google, Mountain View, CA, USA
Walfredo Cirne
Google, Seattle, WA, USA
Narayan Desai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Azevedo, F., Gombert, L., Suter, F. (2019). Reducing the Human-in-the-Loop Component of the Scheduling of Large HTC Workloads. In: Klusáček, D., Cirne, W., Desai, N. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2018. Lecture Notes in Computer Science(), vol 11332. Springer, Cham. https://doi.org/10.1007/978-3-030-10632-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-10632-4_3
Published: 13 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10631-7
Online ISBN: 978-3-030-10632-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics