Compiler Optimizations for Non-contiguous Remote Data Movement

Schneider, Timo; Gerstenberger, Robert; Hoefler, Torsten

doi:10.1007/978-3-319-09967-5_18

Timo Schneider¹⁷,
Robert Gerstenberger¹⁸ &
Torsten Hoefler¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8664))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

666 Accesses
1 Citations

Abstract

Remote Memory Access (RMA) programming is one of the core concepts behind modern parallel programming languages such as UPC and Fortran 2008 or high-performance libraries such as MPI-3 One Sided or SHMEM. Many applications have to communicate non-contiguous data due to their data layout in main memory. Previous studies showed that such non-contiguous transfers can reduce communication performance by up to an order of magnitude. In this work, we demonstrate a simple scheme for statically optimizing non-contiguous RMA transfers by combining partial packing, communication overlap, and remote access pipelining. We determine accurate performance models for the various operations to find near-optimal pipeline parameters. The proposed approach is applicable to all RMA languages and does not depend on the availability of special hardware features such as scatter-gather lists or strided copies. We show that our proposed superpipelining leads to significant improvements compared to either full packing or sending each contiguous segment individually. We outline how our approach can be used to optimize non-contiguous data transfers in PGAS programs automatically. We observed a 37 % performance gain over the fastest of either packing or individual sending for a realistic application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: incorporating long messages into the logP model - one step closer towards a realistic model for parallel computation. In: Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’95), pp. 95–105 (1995)
Google Scholar
Bansal, S., Aiken, A.: Automatic generation of peephole superoptimizers. ACM SIGPLAN Not. 41(11), 394–403 (2006)
Article Google Scholar
Bernard, C., Ogilvie, M., et al.: Studying quarks and gluons on MIMD parallel computers. Int. J. High Perform. Comput. Appl. 54, 61–70 (1991)
Article Google Scholar
Bronevetsky, G.: Communication-sensitive static dataflow for parallel message passing applications. In: Proceedings of the 7th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’09) (2009)
Google Scholar
ten Bruggencate, M., Roweth, D.: DMAPP - an API for one-sided program models on Baker systems. In: Cray User Group Conference (CUG’10) (2010)
Google Scholar
Carrington, L., Komatitsch, D., et al.: High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62 K processors. In: Proceedings of the 22nd International Conference on Supercomputing (SC’08) (2008)
Google Scholar
Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Proceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL’77), pp. 238–252 (1977)
Google Scholar
Denis, A.: A high performance superpipeline protocol for InfiniBand. In: Proceedings of the European Conference on Parallel Processing, pp. 276–287 (2011)
Google Scholar
Hiranandani, S., Kennedy, K., Tseng, C.W.: Evaluating compiler optimizations for Fortran D. J. Parallel Distrib. Comput. 21(1), 27–45 (1994)
Article Google Scholar
Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU-GPU communication management and optimization. ACM SIGPLAN Not. 46(6), 142–151 (2011)
Article Google Scholar
Jenkins, J., Dinan, J., et al.: Enabling fast, noncontiguous GPU data movement in hybrid MPI + GPU environments. In: Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’12) (2012)
Google Scholar
Kjolstad, F., Hoefler, T., Snir, M.: Automatic datatype generation and optimization. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12), pp. 327–328 (2012)
Google Scholar
MPI Forum: MPI: A Message-Passing Interface Standard. Version 3
Google Scholar
Numrich, R.W., Reid, J.: Co-array Fortran for parallel programming. SIGPLAN Fortran Forum 17(2), 1–31 (1998)
Article Google Scholar
Pfister, G.F.: An introduction to the infiniband architecture. In: Hai, J., Toni, C., Buyya, R. (eds.) High Performance Mass Storage and Parallel I/O, pp. 617–632. Wiley, New York (2001)
Google Scholar
Santhanaraman, G., Wu, J., Panda, D.K.: Zero-copy MPI derived datatype communication over infiniband. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 47–56. Springer, Heidelberg (2004)
Chapter Google Scholar
Schkufza, E., Sharma, R., Aiken, A.: Stochastic superoptimization. In: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13), pp. 305–316 (2013)
Google Scholar
Schneider, T., Gerstenberger, R., Hoefler, T.: Application-oriented ping-pong benchmarking: how to assess the real communication overheads. J. Comput. 964, 279–292 (2013)
Google Scholar
Schneider, T., Kjolstad, F., Hoefler, T.: MPI datatype processing using runtime compilation. In: Proceedings of EuroMPI’13, September 2013
Google Scholar
Skamarock, W.C., Klemp, J.B.: A time-split nonhydrostatic atmospheric model for weather research and forecasting applications. J. Comput. Phys. 227(7), 3465–3485 (2008)
Article MathSciNet MATH Google Scholar
UPC Consortium: UPC language specifications. Version 1. 2 (2005)
Google Scholar
der Wijngaart, R.F.V., Wong, P.: NAS parallel benchmarks version 2.4. Technical report, NAS Technical Report NAS-02-007 (2002)
Google Scholar
Woodacre, M., Robb, D., Roe, D., Feind, K.: The SGI AltixTM 3000 global shared memory architecture (2005)
Google Scholar

Download references

Acknowledgments

We thanks the Swiss National Supercomputing Center (CSCS) and the Blue Waters project at NCSA/UIUC for access to the test systems. We also thank the anonymous reviewers for comments that greatly improved our work.

Author information

Authors and Affiliations

Department of Computer Science, ETH Zurich, Zurich, Switzerland
Timo Schneider & Torsten Hoefler
Technische Universität Chemnitz, Chemnitz, Germany
Robert Gerstenberger

Authors

Timo Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Robert Gerstenberger
View author publications
You can also search for this author in PubMed Google Scholar
Torsten Hoefler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Torsten Hoefler .

Editor information

Editors and Affiliations

Silicon Valley, Qualcomm Research, San Jose, California, USA
Călin Cașcaval
Silicon Valley, Qualcomm Research, San Jose, California, USA
Pablo Montesinos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schneider, T., Gerstenberger, R., Hoefler, T. (2014). Compiler Optimizations for Non-contiguous Remote Data Movement. In: Cașcaval, C., Montesinos, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2013. Lecture Notes in Computer Science(), vol 8664. Springer, Cham. https://doi.org/10.1007/978-3-319-09967-5_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-09967-5_18
Published: 01 October 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09966-8
Online ISBN: 978-3-319-09967-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics