Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters

Subramoni, Hari; Awan, Ammar Ahmad; Hamidouche, Khaled; Pekurovsky, Dmitry; Venkatesh, Akshay; Chakraborty, Sourav; Tomko, Karen; Panda, Dhabaleswar K.

doi:10.1007/978-3-319-20119-1_31

Hari Subramoni¹⁵,
Ammar Ahmad Awan¹⁵,
Khaled Hamidouche¹⁵,
Dmitry Pekurovsky¹⁶,
Akshay Venkatesh¹⁵,
Sourav Chakraborty¹⁵,
Karen Tomko¹⁷ &
…
Dhabaleswar K. Panda¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9137))

Included in the following conference series:

International Conference on High Performance Computing

2866 Accesses
9 Citations

Abstract

Several techniques have been proposed in the past for designing non-blocking collective operations on high-performance clusters. While some of them required a dedicated process/thread or periodic probing to progress the collective others needed specialized hardware solutions. The former technique, while applicable to any generic HPC cluster, had the drawback of stealing CPU cycles away from the compute task. The latter gave near perfect overlap but increased the total cost of the HPC installation due to need for specialized hardware and also had other drawbacks that limited its applicability. On the other hand, the Remote Direct Memory Access technology and high performance networks have been pushing the envelope of HPC performance to multi-petaflop levels. However, no scholarly work exists that explores the impact such RDMA technology can bring to the design of non-blocking collective primitives. In this paper, we take up this challenge and propose efficient designs of personalized non-blocking collective operations on top of the basic RDMA primitives. Our experimental evaluation shows that our proposed designs are able to deliver near perfect overlap of computation and communication for personalized collective operations on modern HPC systems at scale. At the microbenchmark level, the proposed RDMA-Aware collectives deliver improvements in latency of up to 89 times for MPI_Igatherv, 3.71 times for MPI_Ialltoall and, 3.23 times for MPI_Iscatter over the state-of-the-art designs. We also observe an improvement of up to 19 % for the P3DFFT kernel at 8,192 cores on the Stampede supercomputing system at TACC.

This research is supported in part by National Science Foundation grants #CCF-1213084, #CNS-1419123, and #IIS-1447804.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Donzis, D., Yeung, P.K., Pekurovsky, D.: Turbulence simulations on O(10\(^{4}\)) processors. In: TeraGrid, June 2008
Google Scholar
Gupta, R., Balaji, P., Panda, D.K., Nieplocha, J.: Efficient collective operations using remote memory operations on VIA-based clusters. In: 2003 Proceedings of the International Parallel and Distributed Processing Symposium, p. 9, April 2003
Google Scholar
Hoefler, T., Lumsdaine, A.: Message progression in parallel computing - to thread or not to thread?. In: Cluster (2008)
Google Scholar
Hoefler, T., Schneider, T., Lumsdaine, A.: Accurately measuring collective operations at massive scale. In Proceedings of the 22nd IEEE International Parallel & Distributed Processing Symposium, PMEO 2008 Workshop, April 2008
Google Scholar
Hoefler, T., Siebert, C., Lumsdaine, A.: Group operation assembly language - a flexible way to express collective communication. In: ICPP-2009 - The 38th International Conference on Parallel Processing. IEEE, September 2009
Google Scholar
Hoefler, T., Squyres, J.M., Rehm, W., Lumsdaine, A.: A case for non-blocking collective operations. In: Min, G., Di Martino, B., Yang, L.T., Guo, M., Rünger, G. (eds.) ISPA Workshops 2006. LNCS, vol. 4331, pp. 155–164. Springer, Heidelberg (2006)
Chapter Google Scholar
Hoefler, T., Gottschling, P., Lumsdaine, A., Rehm, W.: Optimizing a conjugate gradient solver with non-blocking collective operations. Parallel Comput. 33(9), 624–633 (2007)
Article MathSciNet Google Scholar
Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and performance analysis of non-blocking collective operations for MPI. In: 2007 Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 2007, pp. 1–10. IEEE (2007)
Google Scholar
InfiniBand Trade Association. http://www.infinibandta.com
Intel MPI Benchmarks (IMB). https://software.intel.com/en-us/articles/intel-mpi-benchmarks
Liu, J., Jiang, W., Wyckoff, P., Panda, D.K., Ashton, D., Buntinas, D., Gropp, B.,Tooney, B.: High Performance Implementation of MPICH2 over InfiniBand with RDMA Support. In: IPDPS (2004)
Google Scholar
Kandalla, K., Yang, U., Keasler, J., Kolev, T., Moody, A., Subramoni, H., Tomko, K., Vienne, J., Panda, D.K.: Designing non-blocking allreduce with collective offload on infiniband clusters: a case study with conjugate gradient solvers. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS) (2012)
Google Scholar
Kandalla, K., Subramoni, H., Tomko, K., Pekurovsky, D., Sur, S., Panda, D.K.: High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT. Comput. Sci. 26, 237–246 (2011)
Google Scholar
Kandalla, K.C., Subramoni, H., Tomko, K., Pekurovsky, D., Panda, D.K.: A novel functional partitioning approach to design high-performance MPI-3 non-blocking alltoallv collective on multi-core systems. In: 42nd International Conference on Parallel Processing, ICPP 2013, Lyon, France, 1–4 October 2013, pp. 611–620 (2013)
Google Scholar
Kini, S.P., Liu, J., Wu, J., Wyckoff, P., Panda, D.K.: Fast and scalable barrier using rdma and multicast mechanisms for infiniband-based clusters. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 369–378. Springer, Heidelberg (2003)
Chapter Google Scholar
Lawry, W., Wilson, C., Maccabe, A.B., Brightwell, R.: COMB: a portable benchmark suite for assessing MPI overlap. In: IEEE Cluster, pp. 23–26 (2002)
Google Scholar
Liu, J., Jiang, W., Wyckoff, P., Panda, D.K., Ashton, D., Buntinas, D., Gropp, W., Toonen, B.: Design and implementation of MPICH2 over InfiniBand with RDMA support. In: Proceedings of Int’l Parallel and Distributed Processing Symposium (IPDPS 2004), April 2004
Google Scholar
Liu, J., Mamidala, A., Panda, D.K.: Fast and scalable MPI-level broadcast using InfiniBand’s hardware multicast support. In: Proceedings of Int’l Parallel and Distributed Processing Symposium (IPDPS 04), April 2004
Google Scholar
Luo, M., Wang, H., Vienne, J., Panda, D.K.: Redesigning MPI shared memory communication for large multi-core architecture. computer science - research and development, pp. 1–10. doi: 10.1007/s00450-012-0210-8
Luo, M., Wang, H., Vienne, J., Panda, D.K.: Redesigning MPI shared memory communication for large multi-core architecture. Comput. Sci. 28(2–3), 137–146 (2013)
Google Scholar
Venkata, M., Graham, R., Ladd, J., Shamis, P., Rabinovitz, I., Vasily, F., Shainer, G.: ConnectX-2 CORE-direct enabled asynchronous broadcast collective communications. In: Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium, Workshops (2011)
Google Scholar
Mamidala, A., Liu, J., Panda, D.K.: Efficient barrier and allreduce on IBA clusters using hardware multicast and adaptive algorithms. In: IEEE Cluster Computing (2004)
Google Scholar
ConnectX-2 VPI with CORE-Direct Technology. http://www.mellanox.com/page/products_dyn?product_family=61&mtag=connectx_2_vpi
Programmable ConnectX-3 Pro Adapter Card Dual-Port Adapter with VPI. http://www.mellanox.com/page/products_dyn?product_family=202&mtag=programmable_connectx_3_pro_vpi_card
Connect-IB Single/Dual-Port InfiniBand Host Channel Adapter Cards. http://www.mellanox.com/page/products_dyn?product_family=142
Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, March 1994
Google Scholar
MPI-3 Standard Document. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
Network-Based Computing Laboratory. MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE. http://mvapich.cse.ohio-state.edu/
Nomura, A., Ishikawa, Y.: Design of kernel-level asynchronous collective communication. In: Keller, R., Gabriel, E., Resch, M., Dongarra, J. (eds.) EuroMPI 2010. LNCS, vol. 6305, pp. 92–101. Springer, Heidelberg (2010)
Chapter Google Scholar
OSU Micro-benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/
Pekurovsky, D.: P3DFFT: a framework for parallel computations of fourier transforms in three dimensions. SIAM J. Sci. Comput. 34(4), C192–C209 (2012)
Article MATH MathSciNet Google Scholar
Portals Network Programming Interface. http://www.cs.sandia.gov/Portals/
Romanow, A., Bailey, S.: An overview of RDMA over IP. In: Proceedings of International Workshop on Protocols for Long-Distance Networks (PFLDnet2003) (2003)
Google Scholar
Laizet, S., Lamballais, E., Vassilicos, J.C.: A numerical strategy to combine high-order schemes, complex geometry and parallel computing for high resolution dns of fractal generated turbulence. Comput. Fluids 39, 471–484 (2010)
Article MATH Google Scholar
Schneider, T., Eckelmann, S., Hoefler, T., Rehm, W.: Kernel-based offload of collective operations – implementation, evaluation and lessons learned. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 264–275. Springer, Heidelberg (2011)
Chapter Google Scholar
Sandia MPI Micro-Benchmark Suite (SMB). http://www.cs.sandia.gov/smb/index.html
Sur, S., Bondhugula, U.K.R., Mamidala, A.R., Jin, H.-W., Panda, D.K.: High performance RDMA based all-to-all broadcast for infiniband clusters. In: Bader, D.A., Parashar, M., Sridhar, V., Prasanna, V.K. (eds.) HiPC 2005. LNCS, vol. 3769, pp. 148–157. Springer, Heidelberg (2005)
Chapter Google Scholar
Sur, S., Jin, H.-W., Chai, L., Panda, D.K.: RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2006, pp. 32–39. ACM, New York, NY, USA (2006)
Google Scholar
Texas Advanced Computing Center. Stampede Supercomputer. http://www.tacc.utexas.edu/
TOP 500 Supercomputer Sites. http://www.top500.org

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Hari Subramoni, Ammar Ahmad Awan, Khaled Hamidouche, Akshay Venkatesh, Sourav Chakraborty & Dhabaleswar K. Panda
San Diego Supercomputer Center, San Diego, California
Dmitry Pekurovsky
Ohio Supercomputer Center, Columbus, OH, USA
Karen Tomko

Authors

Hari Subramoni
View author publications
You can also search for this author in PubMed Google Scholar
Ammar Ahmad Awan
View author publications
You can also search for this author in PubMed Google Scholar
Khaled Hamidouche
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Pekurovsky
View author publications
You can also search for this author in PubMed Google Scholar
Akshay Venkatesh
View author publications
You can also search for this author in PubMed Google Scholar
Sourav Chakraborty
View author publications
You can also search for this author in PubMed Google Scholar
Karen Tomko
View author publications
You can also search for this author in PubMed Google Scholar
Dhabaleswar K. Panda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hari Subramoni .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum (DKRZ), Hamburg, Germany
Julian M. Kunkel
Deutsches Klimarechenzentrum (DKRZ), Hamburg, Germany
Thomas Ludwig

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Subramoni, H. et al. (2015). Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters. In: Kunkel, J., Ludwig, T. (eds) High Performance Computing. ISC High Performance 2015. Lecture Notes in Computer Science(), vol 9137. Springer, Cham. https://doi.org/10.1007/978-3-319-20119-1_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-20119-1_31
Published: 20 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20118-4
Online ISBN: 978-3-319-20119-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics