Advertisement

AccaSim: a customizable workload management simulator for job dispatching research in HPC systems

  • Cristian GalleguillosEmail author
  • Zeynep Kiziltan
  • Alessio Netti
  • Ricardo Soto
Article
  • 31 Downloads

Abstract

We present AccaSim, a simulator for workload management in HPC systems. Thanks to AccaSim’s scalability to large workload datasets, support for easy customization, and practical automated tools to aid experimentation, users can easily represent various real HPC systems, develop novel advanced dispatchers and evaluate them in a convenient way across different workload sources. AccaSim is thus an attractive tool for conducting job dispatching research in HPC systems.

Keywords

HPC systems Workload management system Job dispatching problem Simulation tool Dispatcher development Dispatcher evaluation 

Notes

Acknowledgements

C. Galleguillos is supported by Postgraduate Grant PUCV 2018. A. Netti is supported by a research fellowship from the Oprecomp-Open Transprecision Computing project. R. Soto is supported by Grant CONICYT/FONDECYT/ REGULAR/1160455. We are grateful to Åke Sandgren, Motoyoshi Kurokawa, and the Czech National Grid Infrastructure MetaCentrum, for providing, respectively, the Seth, RICC and the MetaCentrum workload datasets. We thank Alina Sîrbu for fruitful discussions on the work presented here. Finally, we appreciate the precious comments of the reviewers which helped improve the paper significantly. We especially thank Millian Poquet for signing his review and giving us the possibility to interact during the revision of the paper.

References

  1. 1.
    Acun, B., Jain, N., Bhatele, A., Mubarak, M., Carothers, C.D., Kalé, L.V.: Preliminary evaluation of a parallel trace replay tool for HPC network simulations. In: Proc. of Euro-Par’15 Workshops, vol. 9523 of LNCS, pp. 417–429. Springer (2015)Google Scholar
  2. 2.
    Auweter, A., Bode, A., Brehm, M., Brochard, L., Hammer, N., Huber, H., Panda, R., Thomas, F., Wilde, T.: A case study of energy aware scheduling on supermuc. In:Proc. of ISC’14, vol. 8488 of LNCS, pp. 394–409. Springer (2014)Google Scholar
  3. 3.
    Banerjee, A., Mukherjee, T., Varsamopoulos, G., Gupta, S.K.: Integrating cooling awareness with thermal aware workload placement for hpc data centers. Sustain. Comput. 1(2), 134–150 (2011)Google Scholar
  4. 4.
    Blazewicz, J., Lenstra, J.K., Kan, A.H.G.R.: Scheduling subject to resource constraints: classification and complexity. Discret. Appl. Math. 5(1), 11–24 (1983)MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Bodas, D., Song, J., Rajappa, M., Hoffman, A.: Simple power-aware scheduler to limit power consumption by HPC system within a budget. In: Proc. of E2SC@SC’14, pp. 21–30. IEEE (2014)Google Scholar
  6. 6.
    Borghesi, A., Collina, F., Lombardi, M., Milano, M., Benini, L.: Power capping in high performance computing systems. In:Proc. of CP’15, vol. 9255 of LNCS, pp. 524–540. Springer (2015)Google Scholar
  7. 7.
    Brandt, J.M., Debusschere, B.J., Gentile, A.C., Mayo, J., Pébay, P.P., Thompson, D.C., Wong, M.: Using probabilistic characterization to reduce runtime faults in HPC systems. In: Proc. of CCGRID’08, pp. 759–764. IEEE CS (2008)Google Scholar
  8. 8.
    Brennan, J., Kureshi, I., Holmes, V.: CDES: an approach to HPC workload modelling. In: Proc. of DS-RT’14, pp. 47–54. IEEE CS (2014)Google Scholar
  9. 9.
    Bridi, T., Bartolini, A., Lombardi, M., Milano, M., Benini, L.: A constraint programming scheduler for heterogeneous high-performance computing machines. IEEE Trans. Parallel Distrib. Syst. 27(10), 2781–2794 (2016)CrossRefGoogle Scholar
  10. 10.
    Dutot, P., Mercier, M., Poquet, M., Richard, O.: Batsim: A realistic language-independent resources and jobs management systems simulator. In: Proc. of JSSPP’16, vol. 10353 of Lecture Notes in Computer Science, pp. 178–197. Springer (2016)Google Scholar
  11. 11.
    Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Proc. of JSSPP’01, vol. 2221 of LNCS, pp. 188–206. Springer (2001)Google Scholar
  12. 12.
    Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)CrossRefGoogle Scholar
  13. 13.
    Galleguillos, C., Kiziltan, Z., Netti, A.: Accasim: an HPC simulator for workload management. In: Proc. of CARLA’17, vol. 796 of Communications in Computer and Information Science, pp. 169–184. Springer (2017)Google Scholar
  14. 14.
    Galleguillos, C., Sîrbu, A., Kiziltan, Z., Babaoglu, Ö., Borghesi, A., Bridi, T.: Data-driven job dispatching in HPC systems. In: Proc. of MOD’17, vol. 10710 of Lecture Notes in Computer Science, pp. 449–461. Springer (2017)Google Scholar
  15. 15.
    Gaussier, É., Glesser, D., Reis, V., Trystram, D.: Improving backfilling by using machine learning to predict running times. In: Proc. of SC’15, pp. 64:1–64:10. ACM (2015)Google Scholar
  16. 16.
    Gómez-Martín, C., Vega-Rodríguez, M.A., Sánchez, J.L.G.: Performance and energy aware scheduling simulator for HPC: evaluating different resource selection methods. Concurr. Comput. 27(17), 5436–5459 (2015)CrossRefGoogle Scholar
  17. 17.
    Hurst, W.B., Ramaswamy, S., Lenin, R.B., Hoffman, D.: Modeling and simulation of hpc systems through job scheduling analysis. In: Conference on Applied Research in Information Technology. Acxiom Laboratory of Applied Research (2010)Google Scholar
  18. 18.
    Jain, N., Bhatele, A., White, S., Gamblin, T., Kalé, L. V.: Evaluating HPC networks via simulation of parallel workloads. In: Proc. of SC’16, pp. 154–165. IEEE CS (2016)Google Scholar
  19. 19.
    Klusácek, D., Rudová, H.: Alea 2: job scheduling simulator. In: Proc. of SimuTools’10, pp. 61:1–61:10. ICST/ACM (2010)Google Scholar
  20. 20.
    Klusácek, D., Tóth, S., Podolníková, G.: Real-life experience with major reconfiguration of job scheduling system. In: Proc. of JSSPP’15, vol. 10353 of Lecture Notes in Computer Science, pp. 83–101. Springer (2015)Google Scholar
  21. 21.
    Lelong, J., Reis, V., Trystram, D.: Tuning easy-backfilling queues. In: Proc. of JSSPP’17, vol. 10773 of Lecture Notes in Computer Science, pp. 43–61. Springer (2017)Google Scholar
  22. 22.
    Li, Y., Gujrati, P., Lan, Z., Sun, X.: Fault-driven re-scheduling for improving system-level fault resilience. In: Proc. of ICPP’07, p. 39. IEEE CS (2007)Google Scholar
  23. 23.
    Liu, F., Weissman, J.B.: Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications. In: Proc. of SC’15, pp. 33:1–33:12. ACM (2015)Google Scholar
  24. 24.
    Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63(11), 1105–1122 (2003)CrossRefzbMATHGoogle Scholar
  25. 25.
    Lucero, A.: Simulation of batch scheduling using real production-ready software tools. In: Proc. of IBERGRID’11, pp. 345–356. Netbiblo (2011)Google Scholar
  26. 26.
    Mohamed, N., Al-Jaroodi, J.: Real-time big data analytics: applications and challenges. In: Proc. of HPCS’14, pp. 305–310. IEEE (2014)Google Scholar
  27. 27.
    Mubarak, M., Carothers, C.D., Ross, R.B., Carns, P.H.: Enabling parallel simulation of large-scale HPC network systems. IEEE Trans. Parallel Distrib. Syst. 28(1), 87–100 (2017)CrossRefGoogle Scholar
  28. 28.
    Murali, P., Vadhiyar, S.: Metascheduling of HPC jobs in day-ahead electricity markets. IEEE Trans. Parallel Distrib. Syst. 29(3), 614–627 (2018)CrossRefGoogle Scholar
  29. 29.
    Nakata, M.: All about RICC: RIKEN integrated cluster of clusters. In: Proc. of ICNC’11, pp. 27–29. IEEE Computer Society (2011)Google Scholar
  30. 30.
    Netti, A., Galleguillos, C., Kiziltan, Z., Sîrbu, A., Babaoglu, Ö.: Heterogeneity-aware resource allocation in HPC systems. In: Proc. of ISC’18, vol. 10876 of Lecture Notes in Computer Science, pp. 3–21. Springer (2018)Google Scholar
  31. 31.
    Nuñez, A., Fernández, J., García, J.D., García, F., Carretero, J.: New techniques for simulating high performance MPI applications on large storage networks. J. Supercomput. 51(1), 40–57 (2010)CrossRefGoogle Scholar
  32. 32.
    Rodrigo, G.P., Elmroth, E., Östberg, P., Ramakrishnan, L.: Scsf: a scheduling simulation framework. In: Proc. of JSSPP’17, vol. 10773 of Lecture Notes in Computer Science, pp. 152–173. Springer (2017)Google Scholar
  33. 33.
    Snyder, S., Carns, P.H., Latham, R., Mubarak, M., Ross, R.B., Carothers, C.D., Behzad, B., Luu, H.V.T., Byna, S., Prabhat.: Techniques for modeling large-scale HPC I/O workloads. In: Proc. of PMBS@SC’15, pp. 5:1–5:11. ACM (2015)Google Scholar
  34. 34.
    Stephen, T., Benini, M.: Using and modifying the bsc slurm workload simulator. Technical report, Slurm User Group Meeting (2015)Google Scholar
  35. 35.
    Tang, Q., Gupta, S.K.S., Varsamopoulos, G.: Energy-efficient thermal-aware task scheduling for homogeneous high-performance computing data centers: a cyber-physical approach. IEEE Trans. Parallel Distrib. Syst. 19(11), 1458–1472 (2008)CrossRefGoogle Scholar
  36. 36.
    Wong, A.K.L., Goscinski, A.M.: Evaluating the easy-backfill job scheduling of static workloads on clusters. In: Proc. of CLUSTER’07. IEEE Computer Society (2007)Google Scholar
  37. 37.
    Zhou, Z., Lan, Z., Tang, W., Desai, N.: Reducing energy costs for IBM blue gene/p via power-aware job scheduling. In: Proc. of JSSPP’13, vol. 8429 of LNCS, pp. 96–115. Springer (2014)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Pontificia Universidad Católica de ValparaísoValparaisoChile
  2. 2.University of BolognaBolognaItaly

Personalised recommendations