Progress Thread Placement for Overlapping MPI Non-blocking Collectives Using Simultaneous Multi-threading

  • Alexandre Denis
  • Julien Jaeger
  • Hugo TaboadaEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11339)


Non-blocking collectives have been proposed so as to allow communications to be overlapped with computation in order to amortize the cost of MPI collective operations. To obtain a good overlap ratio, communications and computation have to run in parallel. To achieve this, different hardware and software techniques exists. Dedicated some cores to run progress threads is one of them. However, some CPUs provide Simultaneous Multi-Threading, which is the ability for a core to have multiple hardware threads running simultaneously, sharing the same arithmetic units. Our idea is to use them to run progress threads to avoid dedicated cores allocation. We have run benchmarks on Haswell processors, using its Hyper-Threading capability, and get good results for both performance and overlap only when inter-node communications are used by MPI processes. However, we also show that enabling Simultaneous Multi-Threading for intra-communications leads to bad performances due to cache effects.


  1. 1.
    IMB-NBC benchmarks. Accessed 10 May 2018
  2. 2.
    Almási, G., et al.: Optimization of MPI collective communication on BlueGene/L systems. In: Proceedings of the 19th Annual International Conference on Supercomputing, ICS 2005, pp. 253–262. ACM, New York (2005).
  3. 3.
    Denis, A.: pioman: a pthread-based Multithreaded Communication Engine. In: Euromicro International Conference on Parallel, Distributed and Network-based Processing. Turku, Finland, March 2015.
  4. 4.
    Eggers, S.J., Emer, J.S., Levy, H.M., Lo, J.L., Stamm, R.L., Tullsen, D.M.: Simultaneous multithreading: a platform for next-generation processors. IEEE Micro 17(5), 12–19 (1997)CrossRefGoogle Scholar
  5. 5.
    Hoefler, T., Lumsdaine, A.: Message progression in parallel computing - to thread or not to thread? In: Proceedings of the 2008 IEEE International Conference on Cluster Computing. IEEE Computer Society, October 2008Google Scholar
  6. 6.
    Hoefler, T., Lumsdaine, A.: Optimizing non-blocking Collective Operations for InfiniBand. In: Proceedings of the 22nd IEEE International Parallel & Distributed Processing Symposium, CAC’08 Workshop, April 2008Google Scholar
  7. 7.
    Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and performance analysis of non-blocking collective operations for MPI. In: Proceedings of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM, November 2007Google Scholar
  8. 8.
    Lai, P., Balaji, P., Thakur, R., Panda, D.: ProOnE: a general purpose protocol onload engine for multi- and many-core architectures, June 2009CrossRefGoogle Scholar
  9. 9.
    Ma, T., Bosilca, G., Bouteiller, A., Goglin, B., Squyres, J.M., Dongarra, J.J.: Kernel assisted collective intra-node MPI communication among multi-core and many-core CPUs. In: IEEE (ed.) 40th International Conference on Parallel Processing (ICPP-2011), Taipei, Taiwan, September 2011.
  10. 10.
    Miwa, M., Nakashima, K.: Progression of MPI Non-blocking Collective Operations Using Hyper-Threading, pp. 163–171 (03 2015)Google Scholar
  11. 11.
    MPI Forum: MPI: A Message-Passing Interface Standard Version 3.0, September 2012Google Scholar
  12. 12.
    Pérache, M., Jourdren, H., Namyst, R.: MPC: a unified parallel runtime for clusters of NUMA machines. In: Luque, E., Margalef, T., Benítez, D. (eds.) Euro-Par 2008. LNCS, vol. 5168, pp. 78–88. Springer, Heidelberg (2008). Scholar
  13. 13.
    Rashti, M.J., Afsahi, A.: Improving communication progress and overlap in MPI Rendezvous protocol over RDMA-enabled interconnects. In: 22nd International Symposium on High Performance Computing Systems and Applications, HPCS 2008, pp. 95–101. IEEE (2008)Google Scholar
  14. 14.
    Si, M., Peña, A., Balaji, P., Takagi, M., Ishikawa, Y.: MT-MPI: multithreaded MPI for many-core environments. In: Proceedings of the International Conference on Supercomputing, June 2014Google Scholar
  15. 15.
    Sur, S., Jin, H., Chai, L., Panda, D.: RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 32–39. ACM New York (2006)Google Scholar
  16. 16.
    Terpstra, D., Jagode, H., You, H., Dongarra, J.: Collecting performance datawith PAPI-C. In: Müller, M.S., Resch, M.M., Schulz, A., Nagel, W.E. (eds.) Tools for High Performance Computing 2009, pp. 157–173. Springer, Heidelberg (2010). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Inria, LaBRI, Univ. Bordeaux, CNRS, Bordeaux-INPTalenceFrance
  2. 2.CEA, DAM, DIFArpajonFrance

Personalised recommendations