Abstract
Harnessing the full performance potential of cache-coherent distributed shared memory multiprocessors without inordinate user effort requires a compilation technology that can automatically manage multiple levels of memory hierarchy. This paper describes a working compiler for such machines that automatically partitions loops and data arrays to optimize locality of access.
The compiler implements a solution to the problem of finding partitions of loops and data with minimal communication. Our algorithm handles programs with multiple nested parallel loops accessing many arrays with array access indices being general affine functions of loop variables. It discovers communication-minimal partitions when communication-free partitions do not exist. The compiler also uses sub-blocking to handle finite cache sizes.
A cost model that estimates the cost of a loop and data partition given machine parameters such as cache, local and remote access timings, is presented. Minimizing the cost as estimated by our model is an NP-complete problem, as is the fully general problem of partitioning. A heuristic method which provides good approximate solutions in polynomial time is presented.
The loop and data partitioning algorithm has been implemented in the compiler for the MIT Alewife machine. Results are presented which show that combined optimization of loops and data can result in improvements in runtime by nearly a factor of two over optimization of loops alone.
This research was funded in part by ARPA contract #N00014-94-1-0985 and in part by NSF grant #MIP-9504399.
This is a preview of subscription content, log in via an institution.
Preview
Unable to display preview. Download preview PDF.
References
S. G. Abraham and D. E. Hudak. Compile-time partitioning of iterative parallel loops to reduce cache coherency traffic. IEEE Transactions on Parallel and Distributed Systems, 2(3):318–328, July 1991.
A. Agarwal, D.A. Kranz, and V. Natarajan. Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 6(9):943–962, September 1995.
Anant Agarwal, Ricardo Bianchini, David Chaiken, Kirk Johnson, David Kranz, John Kubiatowicz, Beng-Hong Lim, Kenneth Mackenzie, and Donald Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA'95), pages 2–13, June 1995.
Jennifer M. Anderson and Monica S. Lam. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. In Proceedings of SIGPLAN '93 Conference on Programming Languages Design and Implementation. ACM, June 1993.
R. Barua, D. Kranz, and A. Agarwal. Communication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors. Submitted for publication, June 1995.
R. Bixby, K. Kennedy, and U. Kremer. Automatic Data Layout Using 0–1 Integer Programming. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 111–122, Montreal, Canada, August 1994.
Steve Carr, Kathryn S. McKinley, and Chau-Wen Tzeng. Compiler Optimization for Improving Data Locality. In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), pages 252–262, October 1994.
M. Cierniak and W. Li. Unifying Data and Control Transformations for Distributed Shared-Memory Machines. Proceedings of the SIGPLAN PLDI, 1995.
M. Gupta and P. Banerjee. Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3(2):179–193, March 1992.
Seema Hiranandani, Ken Kennedy, and Chau-Wen Tseng. Compiling Fortran D for MIMD Distributed Memory Machines. Communications of the ACM, 35(8):66–80, August 1992.
Y. Ju and H. Dietz. Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation. In Languages and Compilers for Parallel Computing, pages 344–358, Springer Verlag, 1992.
Kathleen Knobe, Joan Lukas, and Guy Steele Jr. Data Optimization: Allocation of Arrays to Reduce Communication on SIMD Machines. Journal of Parallel and Distributed Computing, 8(2):102–118, August 1990.
J. Ramanujam and P. Sadayappan. Compile-Time Techniques for Data Distribution in Distributed Memory Machines. IEEE Transactions on Parallel and Distributed Systems, 2(4):472–482, October 1991.
Bart Selman, Henry Kautz, and Bram Cohen. Noise Strategies for Improving Local Search. Proceedings, AAAI, 1, 1994.
C.-W. Tseng. An Optimizing Fortran D compiler for MIMD Distributed-Memory Machines. PhD thesis, Rice University, Jan 1993. Published as Rice COMP TR93-199.
R.P. Wilson, R.S. French, C.S. Wilson, S.P. Amarasinghe, J.M. Anderson, S.W.K. Tjiang, S.-W. Liao, C.-W. Tseng, M.W. Hall, M.S. Lam, and J.L. Hennessy. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. ACM SIGPLAN Notices, 29(12):31–37, December 1994.
Michael E. Wolf and Monica S. Lam. A Loop Transformation Theory and an Algorithm to Maximize Parallelism. In The Third Workshop on Programming Languages and Compilers for Parallel Computing, August 1990. Irvine, CA.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Barua, R., Kranz, D., Agarwal, A. (1997). Communication-minimal partitioning of parallel loops and data arrays for cache-coherent distributed-memory multiprocessors. In: Sehr, D., Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds) Languages and Compilers for Parallel Computing. LCPC 1996. Lecture Notes in Computer Science, vol 1239. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0017263
Download citation
DOI: https://doi.org/10.1007/BFb0017263
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63091-3
Online ISBN: 978-3-540-69128-0
eBook Packages: Springer Book Archive