Skip to main content

Compiler Optimizations for Non-contiguous Remote Data Movement

  • Conference paper
  • First Online:
Languages and Compilers for Parallel Computing (LCPC 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8664))

Abstract

Remote Memory Access (RMA) programming is one of the core concepts behind modern parallel programming languages such as UPC and Fortran 2008 or high-performance libraries such as MPI-3 One Sided or SHMEM. Many applications have to communicate non-contiguous data due to their data layout in main memory. Previous studies showed that such non-contiguous transfers can reduce communication performance by up to an order of magnitude. In this work, we demonstrate a simple scheme for statically optimizing non-contiguous RMA transfers by combining partial packing, communication overlap, and remote access pipelining. We determine accurate performance models for the various operations to find near-optimal pipeline parameters. The proposed approach is applicable to all RMA languages and does not depend on the availability of special hardware features such as scatter-gather lists or strided copies. We show that our proposed superpipelining leads to significant improvements compared to either full packing or sending each contiguous segment individually. We outline how our approach can be used to optimize non-contiguous data transfers in PGAS programs automatically. We observed a 37 % performance gain over the fastest of either packing or individual sending for a realistic application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: incorporating long messages into the logP model - one step closer towards a realistic model for parallel computation. In: Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’95), pp. 95–105 (1995)

    Google Scholar 

  2. Bansal, S., Aiken, A.: Automatic generation of peephole superoptimizers. ACM SIGPLAN Not. 41(11), 394–403 (2006)

    Article  Google Scholar 

  3. Bernard, C., Ogilvie, M., et al.: Studying quarks and gluons on MIMD parallel computers. Int. J. High Perform. Comput. Appl. 54, 61–70 (1991)

    Article  Google Scholar 

  4. Bronevetsky, G.: Communication-sensitive static dataflow for parallel message passing applications. In: Proceedings of the 7th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’09) (2009)

    Google Scholar 

  5. ten Bruggencate, M., Roweth, D.: DMAPP - an API for one-sided program models on Baker systems. In: Cray User Group Conference (CUG’10) (2010)

    Google Scholar 

  6. Carrington, L., Komatitsch, D., et al.: High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62 K processors. In: Proceedings of the 22nd International Conference on Supercomputing (SC’08) (2008)

    Google Scholar 

  7. Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Proceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL’77), pp. 238–252 (1977)

    Google Scholar 

  8. Denis, A.: A high performance superpipeline protocol for InfiniBand. In: Proceedings of the European Conference on Parallel Processing, pp. 276–287 (2011)

    Google Scholar 

  9. Hiranandani, S., Kennedy, K., Tseng, C.W.: Evaluating compiler optimizations for Fortran D. J. Parallel Distrib. Comput. 21(1), 27–45 (1994)

    Article  Google Scholar 

  10. Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU-GPU communication management and optimization. ACM SIGPLAN Not. 46(6), 142–151 (2011)

    Article  Google Scholar 

  11. Jenkins, J., Dinan, J., et al.: Enabling fast, noncontiguous GPU data movement in hybrid MPI + GPU environments. In: Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’12) (2012)

    Google Scholar 

  12. Kjolstad, F., Hoefler, T., Snir, M.: Automatic datatype generation and optimization. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12), pp. 327–328 (2012)

    Google Scholar 

  13. MPI Forum: MPI: A Message-Passing Interface Standard. Version 3

    Google Scholar 

  14. Numrich, R.W., Reid, J.: Co-array Fortran for parallel programming. SIGPLAN Fortran Forum 17(2), 1–31 (1998)

    Article  Google Scholar 

  15. Pfister, G.F.: An introduction to the infiniband architecture. In: Hai, J., Toni, C., Buyya, R. (eds.) High Performance Mass Storage and Parallel I/O, pp. 617–632. Wiley, New York (2001)

    Google Scholar 

  16. Santhanaraman, G., Wu, J., Panda, D.K.: Zero-copy MPI derived datatype communication over infiniband. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 47–56. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  17. Schkufza, E., Sharma, R., Aiken, A.: Stochastic superoptimization. In: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13), pp. 305–316 (2013)

    Google Scholar 

  18. Schneider, T., Gerstenberger, R., Hoefler, T.: Application-oriented ping-pong benchmarking: how to assess the real communication overheads. J. Comput. 964, 279–292 (2013)

    Google Scholar 

  19. Schneider, T., Kjolstad, F., Hoefler, T.: MPI datatype processing using runtime compilation. In: Proceedings of EuroMPI’13, September 2013

    Google Scholar 

  20. Skamarock, W.C., Klemp, J.B.: A time-split nonhydrostatic atmospheric model for weather research and forecasting applications. J. Comput. Phys. 227(7), 3465–3485 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  21. UPC Consortium: UPC language specifications. Version 1. 2 (2005)

    Google Scholar 

  22. der Wijngaart, R.F.V., Wong, P.: NAS parallel benchmarks version 2.4. Technical report, NAS Technical Report NAS-02-007 (2002)

    Google Scholar 

  23. Woodacre, M., Robb, D., Roe, D., Feind, K.: The SGI AltixTM 3000 global shared memory architecture (2005)

    Google Scholar 

Download references

Acknowledgments

We thanks the Swiss National Supercomputing Center (CSCS) and the Blue Waters project at NCSA/UIUC for access to the test systems. We also thank the anonymous reviewers for comments that greatly improved our work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Torsten Hoefler .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Schneider, T., Gerstenberger, R., Hoefler, T. (2014). Compiler Optimizations for Non-contiguous Remote Data Movement. In: Cașcaval, C., Montesinos, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2013. Lecture Notes in Computer Science(), vol 8664. Springer, Cham. https://doi.org/10.1007/978-3-319-09967-5_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09967-5_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09966-8

  • Online ISBN: 978-3-319-09967-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics