Communication-minimal partitioning of parallel loops and data arrays for cache-coherent distributed-memory multiprocessors

Barua, Rajeev; Kranz, David; Agarwal, Anant

doi:10.1007/BFb0017263

Communication-minimal partitioning of parallel loops and data arrays for cache-coherent distributed-memory multiprocessors

Rajeev Barua¹,
David Kranz¹ &
Anant Agarwal¹

Communication Optimization
Conference paper
First Online: 01 January 2005

111 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1239))

Abstract

Harnessing the full performance potential of cache-coherent distributed shared memory multiprocessors without inordinate user effort requires a compilation technology that can automatically manage multiple levels of memory hierarchy. This paper describes a working compiler for such machines that automatically partitions loops and data arrays to optimize locality of access.

The compiler implements a solution to the problem of finding partitions of loops and data with minimal communication. Our algorithm handles programs with multiple nested parallel loops accessing many arrays with array access indices being general affine functions of loop variables. It discovers communication-minimal partitions when communication-free partitions do not exist. The compiler also uses sub-blocking to handle finite cache sizes.

A cost model that estimates the cost of a loop and data partition given machine parameters such as cache, local and remote access timings, is presented. Minimizing the cost as estimated by our model is an NP-complete problem, as is the fully general problem of partitioning. A heuristic method which provides good approximate solutions in polynomial time is presented.

The loop and data partitioning algorithm has been implemented in the compiler for the MIT Alewife machine. Results are presented which show that combined optimization of loops and data can result in improvements in runtime by nearly a factor of two over optimization of loops alone.

This research was funded in part by ARPA contract #N00014-94-1-0985 and in part by NSF grant #MIP-9504399.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

References

S. G. Abraham and D. E. Hudak. Compile-time partitioning of iterative parallel loops to reduce cache coherency traffic. IEEE Transactions on Parallel and Distributed Systems, 2(3):318–328, July 1991.
Article Google Scholar
A. Agarwal, D.A. Kranz, and V. Natarajan. Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 6(9):943–962, September 1995.
Article Google Scholar
Anant Agarwal, Ricardo Bianchini, David Chaiken, Kirk Johnson, David Kranz, John Kubiatowicz, Beng-Hong Lim, Kenneth Mackenzie, and Donald Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA'95), pages 2–13, June 1995.
Google Scholar
Jennifer M. Anderson and Monica S. Lam. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. In Proceedings of SIGPLAN '93 Conference on Programming Languages Design and Implementation. ACM, June 1993.
Google Scholar
R. Barua, D. Kranz, and A. Agarwal. Communication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors. Submitted for publication, June 1995.
Google Scholar
R. Bixby, K. Kennedy, and U. Kremer. Automatic Data Layout Using 0–1 Integer Programming. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 111–122, Montreal, Canada, August 1994.
Google Scholar
Steve Carr, Kathryn S. McKinley, and Chau-Wen Tzeng. Compiler Optimization for Improving Data Locality. In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), pages 252–262, October 1994.
Google Scholar
M. Cierniak and W. Li. Unifying Data and Control Transformations for Distributed Shared-Memory Machines. Proceedings of the SIGPLAN PLDI, 1995.
Google Scholar
M. Gupta and P. Banerjee. Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3(2):179–193, March 1992.
Article Google Scholar
Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiling Fortran D for MIMD Distributed Memory Machines. Communications of the ACM, 35(8):66–80, August 1992.
Article Google Scholar
Y. Ju and H. Dietz. Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation. In Languages and Compilers for Parallel Computing, pages 344–358, Springer Verlag, 1992.
Google Scholar
Kathleen Knobe, Joan Lukas, and Guy Steele Jr. Data Optimization: Allocation of Arrays to Reduce Communication on SIMD Machines. Journal of Parallel and Distributed Computing, 8(2):102–118, August 1990.
Article Google Scholar
J. Ramanujam and P. Sadayappan. Compile-Time Techniques for Data Distribution in Distributed Memory Machines. IEEE Transactions on Parallel and Distributed Systems, 2(4):472–482, October 1991.
Article Google Scholar
Bart Selman, Henry Kautz, and Bram Cohen. Noise Strategies for Improving Local Search. Proceedings, AAAI, 1, 1994.
Google Scholar
C.-W. Tseng. An Optimizing Fortran D compiler for MIMD Distributed-Memory Machines. PhD thesis, Rice University, Jan 1993. Published as Rice COMP TR93-199.
Google Scholar
R.P. Wilson, R.S. French, C.S. Wilson, S.P. Amarasinghe, J.M. Anderson, S.W.K. Tjiang, S.-W. Liao, C.-W. Tseng, M.W. Hall, M.S. Lam, and J.L. Hennessy. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. ACM SIGPLAN Notices, 29(12):31–37, December 1994.
Article Google Scholar
Michael E. Wolf and Monica S. Lam. A Loop Transformation Theory and an Algorithm to Maximize Parallelism. In The Third Workshop on Programming Languages and Compilers for Parallel Computing, August 1990. Irvine, CA.
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory for Computer Science, Massachusetts Institute of Technology, 02139, Cambridge, MA
Rajeev Barua, David Kranz & Anant Agarwal

Authors

Rajeev Barua
View author publications
You can also search for this author in PubMed Google Scholar
David Kranz
View author publications
You can also search for this author in PubMed Google Scholar
Anant Agarwal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

David Sehr Utpal Banerjee David Gelernter Alex Nicolau David Padua

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barua, R., Kranz, D., Agarwal, A. (1997). Communication-minimal partitioning of parallel loops and data arrays for cache-coherent distributed-memory multiprocessors. In: Sehr, D., Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds) Languages and Compilers for Parallel Computing. LCPC 1996. Lecture Notes in Computer Science, vol 1239. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0017263

Download citation

DOI: https://doi.org/10.1007/BFb0017263
Published: 10 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63091-3
Online ISBN: 978-3-540-69128-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics