In this work the Numerical Aerodynamic Simulation (NAS) benchmarks have been executed in a systematic way on two clusters of rather different architectures and CPUs, to identify dependencies between MPI tasks mapping and the speedup or resource occupation. To this respect, series of experiments with the NAS kernels have been designed to take into account the context complexity when running scientific applications on HPC environments (CPU, I/O or memory-bound, execution time, degree of parallelism, dedicated computational resources, strong- and weak-scaling behaviour, to cite some). This context includes scheduling decisions, which have a great influence on the performance of the applications, making difficult to achieve an optimal exploitation with cost-effective strategies of the HPC resources. An analysis on how task grouping strategies under various cluster setups drive the execution time of jobs and the infrastructure throughput is provided. As a result, criteria for cluster setup arise linked to maximize performance of individual jobs, total cluster throughput or achieving better scheduling. To this respect, a criterion for execution decisions is suggested. This work is expected to be of interest on the design of scheduling policies and useful to HPC administrators.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Bailey, D., et al.: The NAS Parallel Benchmarks. Tech. Rep. (1994)
Chai, L., Gao, Q., Panda, D.K.: Understanding the impact of multi-core architecture in cluster computing: a case study with Intel dual-core system. In: Proceedings of 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid), pp. 471–478 (2007)
Chavarría-Miranda, D., Nieplocha, J., Tipparaju, V.: Topology-aware tile mapping for clusters of SMPs. In: Proceedings of 3rd Conference on Computing Frontiers 2006, pp. 383–392 (2006)
Intel Memory Latency Checker 3.1: www.intel.com/software/mlc
Jeannot, E., Mercier, G., Tessier, F.: Process placement in multicore clusters: algorithmic issues and practical techniques. IEEE Trans. Parallel Distrib. Syst. 25(4), 993–1002 (2014)
McCalpin, J.D.: Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE TCCA Newsletter (May), pp. 19–25 (1995)
OSU Micro-Benchmarks: http://mvapich.cse.ohio-state.edu/benchmarks
Ribeiro, C.P.: Evaluating CPU and memory affinity for numerical scientific multithreaded benchmarks on multi-cores. Int. J. Comput. Sci. Inf. Security 7(1), 79–93 (2012)
Rodrigues, E.R., Madruga, F.L., Navaux, P.O.A., Panetta, J.: Multi-core aware process mapping and its impact on communication overhead of parallel applications. In: Proceedings of IEEE Symposium on Computers and Communications, pp. 811–817 (2009)
Shainer, G., Lui, P., Liu, T., Wilde, T., Layton, J.: The impact of inter-node latency versus intra-node latency on HPC applications. In: Proceedings of IASTED International Conference on Parallel and Distributed Computing and Systems, pp. 455–460 (2011)
Smith, B., Bode, B.: Performance effects of node mappings on the IBM BlueGene/L machine. In: Euro-Par 2005 Parallel Processing, pp. 1005–1013 (2005)
Top 500: www.top500.org
Wu, X., Taylor, V.: Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore supercomputers. J. Comput. Syst. Sci. 79(8), 1256–1268 (2013)
Xingfu, W., Taylor, V.: Processor partitioning: an experimental performance analysis of parallel applications on SMP clusters systems. In: 19th International Conference on Parallel Distributed Computing and Systems (PDCS’07), pp. 13–18, Los Angeles, CA, USA (2007)
Xingfu, W., Taylor, V.: Using processor partitioning to evaluate the performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems. In: Cray UG Proceedings (CUG 2009), pp. 4–7. Atlanta, USA (2009)
Zhang, C., Yuan, X., Srinivasan, A.: Processor affinity and MPI performance on SMP-CMP clusters. In: IEEE International Symposium on Parallel & Distributed Processing, Workshops and PhD Forum, pp. 1–8. Atlanta, USA (2010)
This work was partially funded by the Spanish Ministry of Economy, Industry and Competitiveness project CODEC2 (TIN2015-63562-R) with European Regional Development Fund (ERDF) as well as carried out on computing facilities provided by the CYTED Network RICAP (517RT0529).
About this article
Cite this article
Rodríguez-Pascual, M., Moríñigo, J.A. & Mayo-García, R. Effect of MPI tasks location on cluster throughput using NAS. Cluster Comput 22, 1187–1198 (2019). https://doi.org/10.1007/s10586-018-02898-7
- MPI application performance
- Cluster throughput
- NAS Parallel Benchmarks