Advertisement

Large-Scale Experiment for Topology-Aware Resource Management

  • Yiannis Georgiou
  • Guillaume Mercier
  • Adèle Villiermet
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10659)

Abstract

A Resource and Job Management System (RJMS) is a crucial system software part of the HPC stack. It is responsible for efficiently delivering computing power to applications in supercomputing environments and its main intelligence relies on resource selection techniques to find the most adapted resources to schedule the users’ jobs. In [8], we introduced a new topology-aware resource selection algorithm to determine the best choice among the available nodes of the platform based on their position in the network and on application behaviour (expressed as a communication matrix). We did integrate this algorithm as a plugin in Slurm and validated it with several optimization schemes by making comparisons with the default Slurm algorithm. This paper presents further experiments with regard to this selection process.

Keywords

Resource management Job allocation Topology-aware placement Scheduling Slurm 

Notes

Acknowledgments

Experiments presented in this paper were carried out using the Grid’5000 testbed (see https://www.grid5000.fr). Part of this work is also supported by the ANR MOEBUS project ANR-13-INFR-0001 and by the ITEA3 COLOC project #13024.

References

  1. 1.
    Balle, S.M., Palermo, D.J.: Enhancing an open source resource manager with multi-core/multi-threaded support. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2007. LNCS, vol. 4942, pp. 37–50. Springer, Heidelberg (2008).  https://doi.org/10.1007/978-3-540-78699-3_3 CrossRefGoogle Scholar
  2. 2.
    Bosilca, G., Foyer, C., Jeannot, E., Mercier, G., Papaure, G.: Online dynamic monitoring of MPI communication. In: 23rd International European Conference on Parallel and Distributed Computing (EuroPar), p. 12. Santiago de Compostella, August 2017. Extended Verion: https://hal.inria.fr/hal-01485243
  3. 3.
    Capit, N., Da Costa, G., Georgiou, Y., Huard, G., Martin, C., Mounié, G., Neyron, P., Richard, O.: A batch scheduler with high level components. In: Cluster Computing and Grid 2005 (CCGrid 2005). IEEE, Cardiff (2005). https://hal.archives-ouvertes.fr/hal-00005106
  4. 4.
  5. 5.
    Fujitsu: Interconnect Topology-Aware Resource Assignment. http://www.fujitsu.com/global/Images/technical-computing-suite-bp-sc12.pdf
  6. 6.
    Georgiou, Y., Hautreux, M.: Evaluating scalability and efficiency of the resource and job management system on large HPC clusters. In: Cirne, W., Desai, N., Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2012. LNCS, vol. 7698, pp. 134–156. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-35867-8_8 CrossRefGoogle Scholar
  7. 7.
    Georgiou, Y., Jeannot, E., Mercier, G., Villiermet, A.: Topology-aware job mapping. Int. J. High Perform. Comput. Appl. 32(1), 14–27 (2018)CrossRefGoogle Scholar
  8. 8.
    Georgiou, Y., Jeannot, E., Mercier, G., Villiermet, A.: Topology-aware resource management for HPC applications. In: Proceedings of 18th International Conference on Distributed Computing and Networking, Hyderabad, India, 5–7 January 2017, p. 17. ACM, Hyderabad, January 2017Google Scholar
  9. 9.
    Jeannot, E., Mercier, G.: Near-optimal placement of MPI processes on hierarchical NUMA architectures. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010. LNCS, vol. 6272, pp. 199–210. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15291-7_20 CrossRefGoogle Scholar
  10. 10.
    Jeannot, E., Mercier, G., Tessier, F.: Process placement in multicore clusters: algorithmic issues and practical techniques. IEEE Trans. Parallel Distrib. Syst. 25(4), 993–1002 (2014).  https://doi.org/10.1109/TPDS.2013.104 CrossRefGoogle Scholar
  11. 11.
  12. 12.
  13. 13.
    Smith, C., McMillan, B., Lumb, I.: Topology aware scheduling in the LSF distributed resource manager. In: Proceedings of Cray User Group Meeting (2001)Google Scholar
  14. 14.
    Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple Linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003).  https://doi.org/10.1007/10968987_3 CrossRefGoogle Scholar
  15. 15.
    Yu, H., Chung, I.H., Moreira, J.: Topology mapping for Blue Gene/L supercomputer. In: Supercomputing 2006. ACM, New York (2006).  https://doi.org/10.1145/1188455.1188576

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Yiannis Georgiou
    • 1
  • Guillaume Mercier
    • 2
  • Adèle Villiermet
    • 3
  1. 1.Atos–BullGrenobleFrance
  2. 2.Bordeaux INPTalenceFrance
  3. 3.Inria Bordeaux Sud-OuestTalenceFrance

Personalised recommendations