Skip to main content

Communication-minimal partitioning of parallel loops and data arrays for cache-coherent distributed-memory multiprocessors

  • Communication Optimization
  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1239))

Abstract

Harnessing the full performance potential of cache-coherent distributed shared memory multiprocessors without inordinate user effort requires a compilation technology that can automatically manage multiple levels of memory hierarchy. This paper describes a working compiler for such machines that automatically partitions loops and data arrays to optimize locality of access.

The compiler implements a solution to the problem of finding partitions of loops and data with minimal communication. Our algorithm handles programs with multiple nested parallel loops accessing many arrays with array access indices being general affine functions of loop variables. It discovers communication-minimal partitions when communication-free partitions do not exist. The compiler also uses sub-blocking to handle finite cache sizes.

A cost model that estimates the cost of a loop and data partition given machine parameters such as cache, local and remote access timings, is presented. Minimizing the cost as estimated by our model is an NP-complete problem, as is the fully general problem of partitioning. A heuristic method which provides good approximate solutions in polynomial time is presented.

The loop and data partitioning algorithm has been implemented in the compiler for the MIT Alewife machine. Results are presented which show that combined optimization of loops and data can result in improvements in runtime by nearly a factor of two over optimization of loops alone.

This research was funded in part by ARPA contract #N00014-94-1-0985 and in part by NSF grant #MIP-9504399.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. G. Abraham and D. E. Hudak. Compile-time partitioning of iterative parallel loops to reduce cache coherency traffic. IEEE Transactions on Parallel and Distributed Systems, 2(3):318–328, July 1991.

    Article  Google Scholar 

  2. A. Agarwal, D.A. Kranz, and V. Natarajan. Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 6(9):943–962, September 1995.

    Article  Google Scholar 

  3. Anant Agarwal, Ricardo Bianchini, David Chaiken, Kirk Johnson, David Kranz, John Kubiatowicz, Beng-Hong Lim, Kenneth Mackenzie, and Donald Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA'95), pages 2–13, June 1995.

    Google Scholar 

  4. Jennifer M. Anderson and Monica S. Lam. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. In Proceedings of SIGPLAN '93 Conference on Programming Languages Design and Implementation. ACM, June 1993.

    Google Scholar 

  5. R. Barua, D. Kranz, and A. Agarwal. Communication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors. Submitted for publication, June 1995.

    Google Scholar 

  6. R. Bixby, K. Kennedy, and U. Kremer. Automatic Data Layout Using 0–1 Integer Programming. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 111–122, Montreal, Canada, August 1994.

    Google Scholar 

  7. Steve Carr, Kathryn S. McKinley, and Chau-Wen Tzeng. Compiler Optimization for Improving Data Locality. In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), pages 252–262, October 1994.

    Google Scholar 

  8. M. Cierniak and W. Li. Unifying Data and Control Transformations for Distributed Shared-Memory Machines. Proceedings of the SIGPLAN PLDI, 1995.

    Google Scholar 

  9. M. Gupta and P. Banerjee. Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3(2):179–193, March 1992.

    Article  Google Scholar 

  10. Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiling Fortran D for MIMD Distributed Memory Machines. Communications of the ACM, 35(8):66–80, August 1992.

    Article  Google Scholar 

  11. Y. Ju and H. Dietz. Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation. In Languages and Compilers for Parallel Computing, pages 344–358, Springer Verlag, 1992.

    Google Scholar 

  12. Kathleen Knobe, Joan Lukas, and Guy Steele Jr. Data Optimization: Allocation of Arrays to Reduce Communication on SIMD Machines. Journal of Parallel and Distributed Computing, 8(2):102–118, August 1990.

    Article  Google Scholar 

  13. J. Ramanujam and P. Sadayappan. Compile-Time Techniques for Data Distribution in Distributed Memory Machines. IEEE Transactions on Parallel and Distributed Systems, 2(4):472–482, October 1991.

    Article  Google Scholar 

  14. Bart Selman, Henry Kautz, and Bram Cohen. Noise Strategies for Improving Local Search. Proceedings, AAAI, 1, 1994.

    Google Scholar 

  15. C.-W. Tseng. An Optimizing Fortran D compiler for MIMD Distributed-Memory Machines. PhD thesis, Rice University, Jan 1993. Published as Rice COMP TR93-199.

    Google Scholar 

  16. R.P. Wilson, R.S. French, C.S. Wilson, S.P. Amarasinghe, J.M. Anderson, S.W.K. Tjiang, S.-W. Liao, C.-W. Tseng, M.W. Hall, M.S. Lam, and J.L. Hennessy. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. ACM SIGPLAN Notices, 29(12):31–37, December 1994.

    Article  Google Scholar 

  17. Michael E. Wolf and Monica S. Lam. A Loop Transformation Theory and an Algorithm to Maximize Parallelism. In The Third Workshop on Programming Languages and Compilers for Parallel Computing, August 1990. Irvine, CA.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

David Sehr Utpal Banerjee David Gelernter Alex Nicolau David Padua

Rights and permissions

Reprints and permissions

Copyright information

© 1997 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Barua, R., Kranz, D., Agarwal, A. (1997). Communication-minimal partitioning of parallel loops and data arrays for cache-coherent distributed-memory multiprocessors. In: Sehr, D., Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds) Languages and Compilers for Parallel Computing. LCPC 1996. Lecture Notes in Computer Science, vol 1239. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0017263

Download citation

  • DOI: https://doi.org/10.1007/BFb0017263

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-63091-3

  • Online ISBN: 978-3-540-69128-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics