Job Management with mpi_jm

  • Evan BerkowitzEmail author
  • Gustav Jansen
  • Kenneth McElvain
  • André Walker-Loud
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11203)


Access to Leadership computing is required for HPC applications that require a large fraction of compute nodes for a single computation and also for use cases where the volume of smaller tasks can only be completed in a competitive or reasonable time frame through use of these Leadership computing facilities. In the latter case, a robust and lightweight manager is ideal so that all these tasks can be computed in a machine-friendly way, notably with minimal use of mpirun or equivalent to launch the executables (simple bundling of tasks can over-tax the service nodes and crash the entire scheduler). Our library, mpi_jm, can manage such allocations, provided access to the requisite MPI functionality is provided. mpi_jm is fault-tolerant against a modest number of down or non-communicative nodes, can begin executing work on smaller portions of a larger allocation before all nodes become available for the allocation, can manage GPU-intensive and CPU-only work independently and can overlay them peacefully on shared nodes. It is easily incorporated into existing MPI-capable executables, which then can run both independently and under mpi_jm management. It provides a flexible Python interface, unlocking many high-level libraries, while also tightly binding users’ executables to hardware.


Pilot systems Job management CORAL 



We are indebted to the Livermore Computing Center for access to Sierra and help getting set up there. In particular, our contacts John Gyllenhal and Adam Bertsch were very helpful and responsive, and other early users were very cooperative and collaborative, keeping us appraised of the state of the machine, difficulties they encountered, and workarounds. In particular, Jim Glosli, Tomas Oppelstrup, and especially Tom Scogland were very generous with their time and concern. At the Oak Ridge Leadership Computing Facility, Jack Wells provided excellent help, advice, and encouragement.

An award of computer time was provided by the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program to CalLat (2016) as well as the Lawrence Livermore National Laboratory (LLNL) Multiprogrammatic and Institutional Computing program through a Tier 1 Grand Challenge award. This research used the NVIDIA GPU-accelerated Titan and Summit supercomputers at the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725, and the NVIDIA GPU-accelerated Surface, Ray, and Sierra supercomputers LLNL. This work was performed under the auspices of the U.S. Department of Energy by LLNL under Contract No. DE-AC52-07NA27344 and under contract DE-AC02-05CH11231, which the Regents of the University of California manage and operate Lawrence Berkeley National Laboratory and the National Energy Research Scientific Computing Center.


  1. 1.
    QMP (QCD Message Passing) (2004).
  2. 2.
    MPI: A message-passing interface standard, version 3.1 (2015). Accessed 14 Feb 2017
  3. 3.
    Ayyar, V., Hackett, D.C., Jay, W.I., Neil, E.T.: Automated lattice data generation. In: EPJ Web of Conferences, vol. 175, p. 09009. (2018). Scholar
  4. 4.
    Babich, R., Clark, M.A., Joo, B., Shi, G., Brower, R.C., Gottlieb, S.: Scaling lattice QCD beyond 100 GPUs. In: SC11 International Conference for High Performance Computing, Networking, Storage and Analysis Seattle, Washington, 12–18 November 2011 (2011).
  5. 5.
    Ben-Kiki, O., Evans, C., döt Net, I.: YAML Ain’t Markup Language (YAML\(^{\rm TM}\)) Version 1.2 (2001–2009).
  6. 6.
    Berkowitz, E., et al.: An accurate calculation of the nucleon axial charge with lattice QCD (2017)Google Scholar
  7. 7.
    Berkowitz, E., et al.: Möbius Domain-Wall fermions on gradient-flowed dynamical HISQ ensembles. Phys. Rev. D 96(5), 054513 (2017). Scholar
  8. 8.
    Berkowitz, E.: METAQ: Bundle Supercomputing Tasks (2017).
  9. 9.
    Berkowitz, E., Jansen, G.R., McElvain, K., Walker-Loud, A.: Job Management and Task Bundling. In: EPJ Web of Conferences, vol. 175, p. 09007 (2018). Scholar
  10. 10.
    Chang, C., et al.: Nucleon axial coupling from Lattice QCD. In: 35th International Symposium on Lattice Field Theory (Lattice 2017) Granada, Spain, 18–24 June 2017 (2017).
  11. 11.
    Chang, C., Nicholson, A., et al.: A percent-level determination of the nucleon axial coupling from quantum chromodynamics. Nature 558, 91 (2018). arXiv:1805.12130CrossRefGoogle Scholar
  12. 12.
    Clark, M.A., Babich, R., Barros, K., Brower, R.C., Rebbi, C.: Solving lattice QCD systems of equations using mixed precision solvers on GPUs. Comput. Phys. Commun. 181, 1517–1528 (2010). Scholar
  13. 13.
    Lawrence Livermore National Laboratory: Sierra
  14. 14.
    Nicholson, A., et al.: Neutrinoless double beta decay from lattice QCD. PoS LATTICE2016, 017 (2016)Google Scholar
  15. 15.
    Oak Ridge Leadership Computing Facility: Summit

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Institut für Kernphysik and Institute for Advanced Simulation, Forschungszentrum JülichJülichGermany
  2. 2.National Center for Computational Sciences and Physics DivisionOak Ridge National LaboratoryOak RidgeUSA
  3. 3.University of California, BerkeleyBerkeleyUSA
  4. 4.Lawrence Berkeley National LaboratoryBerkeleyUSA

Personalised recommendations