Advertisement

Using Pilot Systems to Execute Many Task Workloads on Supercomputers

  • Andre Merzky
  • Matteo Turilli
  • Manuel Maldonado
  • Mark Santcroos
  • Shantenu Jha
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11332)

Abstract

High performance computing systems have historically been designed to support applications comprised of mostly monolithic, single-job workloads. Pilot systems decouple workload specification, resource selection, and task execution via job placeholders and late-binding. Pilot systems help to satisfy the resource requirements of workloads comprised of multiple tasks. RADICAL-Pilot (RP) is a modular and extensible Python-based pilot system. In this paper we describe RP’s design, architecture and implementation, and characterize its performance. RP is capable of spawning more than 100 tasks/second and supports the steady-state execution of up to 16K concurrent tasks. RP can be used stand-alone, as well as integrated with other application-level tools as a runtime system.

Keywords

Pilot system Placeholder job Multilevel scheduling HPC workflow 

Notes

Acknowledgments

This work is supported by NSF “CAREER” ACI-1253644, NSF ACI-1440677 “RADICAL-Cybertools” and DOE Award DE-SC0016280. We acknowledge access to computational facilities: XSEDE resources (TG-MCB090174) and Blue Waters (NSF-1713749).

References

  1. 1.
    Hwang, E., Kim, S., Yoo, T.K., Kim, J.S., Hwang, S., Choi, Y.R.: Resource allocation policies for loosely coupled applications in heterogeneous computing systems. IEEE Trans. Parallel Distrib. Syst. 27(8), 2349–2362 (2016)CrossRefGoogle Scholar
  2. 2.
    Turilli, M., Santcroos, M., Jha, S.: A comprehensive perspective on Pilot-Jobs. ACM Comput. Surv. (2017, accepted, in press). http://arxiv.org/abs/1508.04180
  3. 3.
    Preto, J., Clementi, C.: Fast recovery of free energy landscapes via diffusion-map-directed molecular dynamics. Phys. Chem. Chem. Phys. 16(36), 19181–19191 (2014)CrossRefGoogle Scholar
  4. 4.
    Cheatham III, T.E., Roe, D.R.: The impact of heterogeneous computing on workflows for biomolecular simulation and analysis. Comput. Sci. Eng. 17(2), 30–39 (2015)CrossRefGoogle Scholar
  5. 5.
    Sugita, Y., Okamoto, Y.: Replica-exchange molecular dynamics method for protein folding. Chem. Phys. Lett. 314(1), 141–151 (1999)CrossRefGoogle Scholar
  6. 6.
    Pordes, R., et al.: The open science grid. J. Phys. Conf. Ser. 78(1), 012057 (2007)CrossRefGoogle Scholar
  7. 7.
    Maeno, T., et al.: Evolution of the ATLAS PanDA workload management system for exascale computational science. J. Phys. Conf. Ser. 513(3), 032062 (2014). Proceedings of the 20th International Conference on Computing in High Energy and Nuclear Physics (CHEP 2013)CrossRefGoogle Scholar
  8. 8.
    Raicu, I., Zhao, Y., Dumitrescu, C., Foster, I., Wilde, M.: Falkon: a Fast and Light-weight tasK executiON framework. In: Proceedings of the 8th ACM/IEEE Conference on Supercomputing, p. 43. ACM (2007)Google Scholar
  9. 9.
    Wilde, M., Hategan, M., Wozniak, J.M., Clifford, B., Katz, D.S., Foster, I.: Swift: a language for distributed parallel scripting. Parallel Comput. 37(9), 633–652 (2011)CrossRefGoogle Scholar
  10. 10.
    CCM. http://bit.ly/cray_ccm. Accessed Jan 2018
  11. 11.
    Karo, M., Lagerstrom, R., Kohnke, M., Albing, C.: The application level placement scheduler (2006)Google Scholar
  12. 12.
    Castain, R.H., Squyres, J.M.: Creating a transparent, distributed, and resilient computing environment: the OpenRTE project. J. Supercomput. 42(1), 107–123 (2007)CrossRefGoogle Scholar
  13. 13.
  14. 14.
  15. 15.
  16. 16.
    Canon, R.S., Ramakrishnan, L., Srinivasan, J.: My Cray can do that? Supporting diverse workloads on the Cray XE-6. In: Cray User Group (2012)Google Scholar
  17. 17.
  18. 18.
    Ahn, D.H., Garlick, J., Grondona, M., Lipari, D., Springmeyer, B., Schulz, M.: Flux: a next-generation resource management framework for large HPC centers. In: 2014 43rd International Conference on Parallel Processing Workshops (ICCPW), pp. 9–17. IEEE (2014)Google Scholar
  19. 19.
    Merzky, A., Weidner, O., Jha, S.: SAGA: a standardized access layer to heterogeneous distributed computing infrastructure. Software-X (2015).  https://doi.org/10.1016/j.softx.2015.03.001CrossRefGoogle Scholar
  20. 20.
    Santcroos, M., Castain, R., Merzky, A., Bethune, I., Jha, S.: Executing dynamic heterogeneous workloads on blue waters with radical-pilot. In: Cray User Group 2016 (2016)Google Scholar
  21. 21.
  22. 22.
    CFFI Documentation. http://cffi.readthedocs.org
  23. 23.
    Merzky, A., Turilli, M., Maldonado, M., Jha, S.: Design and performance characterization of RADICAL-pilot on titan. arXiv preprint arXiv:1801.01843 (2018)
  24. 24.
    Merzky, A., Santcroos, M., Turilli, M., Jha, S.: Executing dynamic and heterogeneous workloads on super computers (2016, under review). http://arxiv.org/abs/1512.08194
  25. 25.
    Luckow, A., Santcroos, M., Merzky, A., Weidner, O., Mantha, P., Jha, S.: P*: a model of pilot-abstractions. In: IEEE 8th International Conference on e-Science, pp. 1–10 (2012).  https://doi.org/10.1109/eScience.2012.6404423
  26. 26.
    Jha, S., Kasson, P.M.: High-level software frameworks to surmount the challenge of 100x scaling for biomolecul ar simulation science. White Paper submitted to NIH-NSF Request for Information (2015).  https://doi.org/10.5281/zenodo.44377
  27. 27.
    Balasubramanian, V., Treikalis, A., Weidner, O., Jha, S.: Ensemble toolkit: scalable and flexible execution of ensembles of tasks. In: 2016 45th International Conference on Parallel Processing (ICPP), vol. 00, pp. 458–463, August 2016Google Scholar
  28. 28.
    Treikalis, A., Merzky, A., Chen, H., Lee, T.S., York, D.M., Jha, S.: RepEx: a flexible framework for scalable replica exchange molecular dynamics simulations. In: 2016 45th International Conference on Parallel Processing (ICPP), August 2016Google Scholar
  29. 29.
    Balasubramanian, V., et al.: Harnessing the power of many: extensible toolkit for scalable ensemble applications (2017). https://arxiv.org/abs/1710.08491
  30. 30.
    Balasubramanian, V., et al.: ExTASY: scalable and flexible coupling of MD simulations and advanced sampling techniques. In: 2016 IEEE 12th International Conference on e-Science (e-Science), pp. 361–370, October 2016Google Scholar
  31. 31.

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Andre Merzky
    • 1
  • Matteo Turilli
    • 1
  • Manuel Maldonado
    • 1
  • Mark Santcroos
    • 1
  • Shantenu Jha
    • 1
    • 2
  1. 1.RADICAL Laboratory, Electrical and Computer EngineeringRutgers UniversityPiscatawayUSA
  2. 2.Brookhaven National LaboratoryUptonUSA

Personalised recommendations