Exploiting Non-blocking Remote Memory Access Communication in Scientific Benchmarks

  • Vinod Tipparaju
  • Manojkumar Krishnan
  • Jarek Nieplocha
  • Gopalakrishnan Santhanaraman
  • Dhabaleswar Panda
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2913)


This paper describes a comparative performance study of MPI and Remote Memory Access (RMA) communication models in context of four scientific benchmarks: NAS MG, NAS CG, SUMMA matrix multiplication, and Lennard Jones molecular dynamics on clusters with the Myrinet network. It is shown that RMA communication delivers a consistent performance advantage over MPI. In some cases an improvement as much as 50% was achieved. Benefits of using non-blocking RMA for overlapping computation and communication are discussed.


Conjugate Gradient Message Passing Matrix Vector Multiplication Pacific Northwest National Laboratory Global Array 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bariuso, R., Knies, A.: SHMEM’s User’s Guide, Cray Research, SN-2516 (1994)Google Scholar
  2. 2.
    Luecke, G.R., Spanoyannis, S., Kraeva, M.: Comparing the Performance and Scalability of SHMEM and MPI-2 One-Sided Routines on a SGI Origin 2000 and a Cray T3E-600, J. PEMCS (December 2002)Google Scholar
  3. 3.
    Nieplocha, J., Tipparaju, V., Saify, A., Panda, D.: Protocols and Strategies for Optimizing Remote Memory Operations on Clusters. In: Proc. Communication Architecture for Clusters Workshop of IPDPS 2002 (2002)Google Scholar
  4. 4.
    Nieplocha, J., Tipparaju, V., Ju, J., Apra, E.: One-sided communication on Myrinet. Cluster Computing 6, 115–124 (2003)CrossRefGoogle Scholar
  5. 5.
    Nieplocha, J., Carpenter, B.: ARMCI: A Portable Remote Memory Copy Library for Distributed Array Libraries and Compiler Run-time Systems. In: Proc. RTSPP IPPS/SDP 1999 (1999)Google Scholar
  6. 6.
    Nieplocha, J., Harrison, R.J., Littlefield, R.J.: Global Arrays: A portable s̀hared-memory’ programming model for distributed memory computers. In: Proc. Supercomputing 1994 (1994)Google Scholar
  7. 7.
    Numrich, R., Reid, J.K.: Co-Array Fortran for parallel programming. ACM Fortran Forum 17(2), 1–31 (1998)CrossRefGoogle Scholar
  8. 8.
    Carlson, W.W., Draper, J.M., Culler, D.E., Yelick, K., Brooks, E., Warren, K.: Introduction to UPC and language specification. Tech Report CCS-TR-99-157, Center for Computing Sciences (1999)Google Scholar
  9. 9.
    Parzyszek, K., Nieplocha, J., Kendall, R.: A Generalized Portable SHMEM Library for High Performance Computing. In: Proc PDCS 2000 (2000)Google Scholar
  10. 10.
    Myricom, The GM Message Passing System, October 16 (1999) Google Scholar
  11. 11.
    Bailey, D., Barszcz, E., Barton, J., Browning, D., Carter, R., Dagum, L., Fatoohi, R., Fineberg, S., Frederickson, P., Lasinski, T., Schreiber, R., Simon, H., Venkatakrishnan, V., Weeratunga, S.: The NAS parallel benchmarks. Tech. Rep. RNR-94-007, NASA Ames Research Center (March 1994)Google Scholar
  12. 12.
    Van de Geijn, R., Watts, J.: SUMMA: Scalable Universal Matrix Multiplication Algorithm. Concurrency: Practice and Experience 9, 255–274 (1997)CrossRefGoogle Scholar
  13. 13.
    Plimpton, S.J.: Fast Parallel Algorithms for Short-Range Molecular Dynamics. J. Comp. Phys. 117, 1–19 (1995)zbMATHCrossRefGoogle Scholar
  14. 14.
    Plimpton, S.J.: Scalable Parallel Molecular Dynamics on MIMD supercomputers. In: Proceedings of Scalable High Performance Computing Conference 1992 (1992)Google Scholar
  15. 15.
    Esselink, K., Smit, B., Hilbers, P.A.J.: Efficient Parallel Implementation of Molecular Dynamics on a Toroidal Network: I. Parallelizing strategy. J. Comp. Phys. 106, 101–107 (1993)CrossRefGoogle Scholar
  16. 16.
    Kruskal, C., Weiss, A.: Allocating Independent Subtasks on Parallel Processors. IEEE Transactions on Software Engineering SE-11(10) (1985)Google Scholar
  17. 17.
    Numrich, R.W., Reid, J., Kim, K.: Writing a multigrid solver using Coarray Fortran. In: Proceedings of the Fourth International Workshop on Applied Parallel Computing, Umea, Sweden (June 1998)Google Scholar
  18. 18.
    Shan, H., Singh, J.P., Biswas, R., Oliker, L.: A Comparison of Three Programming Models for Adaptive Applications on the Origin 2000. In: Proc. SC 2000 (2000)Google Scholar
  19. 19.
    Chamberlain, B.L., Deitz, S.J., Snyder, L.: A Comparative Study of the NAS MG Benchmark across Parallel Languages and Architectures. In: SC 2000 (2000)Google Scholar
  20. 20.
    Baden, S.B., Fink, S.J.: Communication overlap in multi-tier parallel algorithms. In: Conf. Proc. SC 1998, Orlando FL (November 1998)Google Scholar
  21. 21.
    Center for Programming Models for Scalable Parallel Computing,

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Vinod Tipparaju
    • 1
  • Manojkumar Krishnan
    • 1
  • Jarek Nieplocha
    • 1
  • Gopalakrishnan Santhanaraman
    • 2
  • Dhabaleswar Panda
    • 2
  1. 1.Pacific Northwest National LaboratoryRichlandUSA
  2. 2.Ohio State UniversityColumbusUSA

Personalised recommendations