Effect of MPI tasks location on cluster throughput using NAS

Abstract

In this work the Numerical Aerodynamic Simulation (NAS) benchmarks have been executed in a systematic way on two clusters of rather different architectures and CPUs, to identify dependencies between MPI tasks mapping and the speedup or resource occupation. To this respect, series of experiments with the NAS kernels have been designed to take into account the context complexity when running scientific applications on HPC environments (CPU, I/O or memory-bound, execution time, degree of parallelism, dedicated computational resources, strong- and weak-scaling behaviour, to cite some). This context includes scheduling decisions, which have a great influence on the performance of the applications, making difficult to achieve an optimal exploitation with cost-effective strategies of the HPC resources. An analysis on how task grouping strategies under various cluster setups drive the execution time of jobs and the infrastructure throughput is provided. As a result, criteria for cluster setup arise linked to maximize performance of individual jobs, total cluster throughput or achieving better scheduling. To this respect, a criterion for execution decisions is suggested. This work is expected to be of interest on the design of scheduling policies and useful to HPC administrators.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

References

  1. 1.

    Bailey, D., et al.: The NAS Parallel Benchmarks. Tech. Rep. (1994)

  2. 2.

    Bonnie++: www.coker.com.au/bonnie++

  3. 3.

    Chai, L., Gao, Q., Panda, D.K.: Understanding the impact of multi-core architecture in cluster computing: a case study with Intel dual-core system. In: Proceedings of 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid), pp. 471–478 (2007)

  4. 4.

    Chavarría-Miranda, D., Nieplocha, J., Tipparaju, V.: Topology-aware tile mapping for clusters of SMPs. In: Proceedings of 3rd Conference on Computing Frontiers 2006, pp. 383–392 (2006)

  5. 5.

    Intel Memory Latency Checker 3.1: www.intel.com/software/mlc

  6. 6.

    Jeannot, E., Mercier, G., Tessier, F.: Process placement in multicore clusters: algorithmic issues and practical techniques. IEEE Trans. Parallel Distrib. Syst. 25(4), 993–1002 (2014)

    Article  Google Scholar 

  7. 7.

    McCalpin, J.D.: Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE TCCA Newsletter (May), pp. 19–25 (1995)

  8. 8.

    OSU Micro-Benchmarks: http://mvapich.cse.ohio-state.edu/benchmarks

  9. 9.

    Ribeiro, C.P.: Evaluating CPU and memory affinity for numerical scientific multithreaded benchmarks on multi-cores. Int. J. Comput. Sci. Inf. Security 7(1), 79–93 (2012)

    Google Scholar 

  10. 10.

    Rodrigues, E.R., Madruga, F.L., Navaux, P.O.A., Panetta, J.: Multi-core aware process mapping and its impact on communication overhead of parallel applications. In: Proceedings of IEEE Symposium on Computers and Communications, pp. 811–817 (2009)

  11. 11.

    Shainer, G., Lui, P., Liu, T., Wilde, T., Layton, J.: The impact of inter-node latency versus intra-node latency on HPC applications. In: Proceedings of IASTED International Conference on Parallel and Distributed Computing and Systems, pp. 455–460 (2011)

  12. 12.

    Smith, B., Bode, B.: Performance effects of node mappings on the IBM BlueGene/L machine. In: Euro-Par 2005 Parallel Processing, pp. 1005–1013 (2005)

  13. 13.

    Top 500: www.top500.org

  14. 14.

    Wu, X., Taylor, V.: Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore supercomputers. J. Comput. Syst. Sci. 79(8), 1256–1268 (2013)

    MathSciNet  Article  Google Scholar 

  15. 15.

    Xingfu, W., Taylor, V.: Processor partitioning: an experimental performance analysis of parallel applications on SMP clusters systems. In: 19th International Conference on Parallel Distributed Computing and Systems (PDCS’07), pp. 13–18, Los Angeles, CA, USA (2007)

  16. 16.

    Xingfu, W., Taylor, V.: Using processor partitioning to evaluate the performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems. In: Cray UG Proceedings (CUG 2009), pp. 4–7. Atlanta, USA (2009)

  17. 17.

    Zhang, C., Yuan, X., Srinivasan, A.: Processor affinity and MPI performance on SMP-CMP clusters. In: IEEE International Symposium on Parallel & Distributed Processing, Workshops and PhD Forum, pp. 1–8. Atlanta, USA (2010)

Download references

Acknowledgements

This work was partially funded by the Spanish Ministry of Economy, Industry and Competitiveness project CODEC2 (TIN2015-63562-R) with European Regional Development Fund (ERDF) as well as carried out on computing facilities provided by the CYTED Network RICAP (517RT0529).

Author information

Affiliations

Authors

Corresponding author

Correspondence to José A. Moríñigo.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Rodríguez-Pascual, M., Moríñigo, J.A. & Mayo-García, R. Effect of MPI tasks location on cluster throughput using NAS. Cluster Comput 22, 1187–1198 (2019). https://doi.org/10.1007/s10586-018-02898-7

Download citation

Keywords

  • MPI application performance
  • Benchmarking
  • Cluster throughput
  • NAS Parallel Benchmarks