Summary
This paper presents a theoretical framework for automatically partitioning parallel loops and data arrays for cache-coherent NUMA multiprocessors to minimize both cache coherency traffic and remote memory references. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. We also present a heuristic for combined partitioning of loops and data arrays to maximize the probability that references hit in the cache, and to maximize the probability cache misses are satisfied by the local memory. We have implemented this framework in a compiler for Alewife, a distributed shared memory multiprocessor.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Constantine D. Polychronopoulos and David J. Kuck. Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers. IEEE Transactions on Computers, C-36(12), December 1987.
E. Mohr, D. Kranz, and R. Halstead. Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs. IEEE Transactions on Parallel and Distributed Systems, 2(3):264–280, July 1991.
M. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 91 Conference Programming Language Design and Implementation, pages 30–44, 1991.
D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing, 5:587–616, 1988.
Harold S. Stone and Dominique Thiebaut. Footprints in the Cache. In Proceedings of ACM SIGMETRICS 1986, pages 4–8, May 1986.
F. Irigoin and R. Triolet. Supernode Partitioning. In 15th Symposium on Principles of Programming Languages (POPL XV), pages 319–329, January 1988.
S. G. Abraham and D. E. Hudak. Compile-time partitioning of iterative parallel loops to reduce cache coherency traffic. IEEE Transactions on Parallel and Distributed Systems, 2(3):318–328, July 1991.
J. Ramanujam and P. Sadayappan. Compile-Time Techniques for Data Distribution in Distributed Memory Machines. IEEE Transactions on Parallel and Distributed Systems, 2(4):472–482, October 1991.
Jennifer M. Anderson and Monica S. Lam. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. In Proceedings of SIGPLAN’ 93 Conference on Programming Languages Design and Implementation. ACM, June 1993.
M. Gupta and P. Banerjee. Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3(2):179–193, March 1992.
Robert Schreiber and Jack Dongarra. Automatic Blocking of Nested Loops. Technical report, RIACS, NASA Ames Research Center, and Oak Ridge National Laboratory, May 1990.
J. Ferrante, V. Sarkar, and W. Thrash. On Estimating and Enhancing Cache Effectiveness, pages 328–341. Springer-Verlag, August 1991. Lecture Notes in Computer Science: Languages and Compilers for Parallel Computing. Editors U. Banerjee and D. Gelernter and A. Nicolau and D. Padua.
J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for nonshared memory machines. In Proceedings of Supercomputing’ 91. IEEE Computer Society Press, 1991.
G. N. Srinivasa Prasanna, Anant Agarwal, and Bruce R. Musicus. Hierarchical Compilation of Macro Dataflow Graphs for Multiprocessors with Local Memory. IEEE Transactions on Parallel and Distributed Systems, July 1994.
Rajeev Barua, David Kranz, and Anant Agarwal. Global Partitioning of Parallel Loops and Data Arrays for Caches and Distributed Memory in Multiprocessors. Submitted for publication, March 1994.
Monical Lam, Edward E. Rothberg, and Michael E. Wolf. The Cache Performance and Optimizations of Blocked Algorithms. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV), pages 63–74. ACM, April 1991.
A. Agarwal, J. V. Guttag, C. N. Hadjicostis, and M. C. Papaefthymiou. Memory assignment for multiprocessor caches through grey coloring. In PARLE’94 Parallel Architectures and Languages Europe, pages 351–362. Springer-Verlag Lecture Notes in Computer Science 817, July 1994.
A. Carnevali, V. Natarajan, and A. Agarwal. A Relationship between the Number of Lattice Points within Hyperparallelepipeds and their Volume. Motorola Cambridge Research Center. In preparation., August 1993.
Gilbert Strang. Linear algebra and its applications, volume 3rd edition. Harcourt Brace Jovanovich, San Diego, CA, 1988.
A. Schrijver. Theory of Linear and Integer Programming. John Wiley & Sons, 1990.
George Arfken. Mathematical Methods for Physics. Academic Press, 1985.
Y. Ju and H. Dietz. Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation. In Languages and Compilers for Parallel Computing, pages 344–358, Springer Verlag, 1992.
R. Barua, D. Kranz, and A. Agarwal. Communication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors. In Languages and Compilters for Parallel Computing. Springer-Verlag Publishers, August 1996.
Anant Agarwal, Ricardo Bianchini, David Chaiken, Kirk Johnson, David Kranz, John Kubiatowicz, Beng-Hong Lim, Kenneth Mackenzie, and Donald Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95), pages 2–13, June 1995.
Paul S. Barth, Rishiyur S. Nikhil, and Arvind. M-Structures: Extending a Parallel, Non-strict, Functional Language with State. In Proceedings of the 5th ACM Conference on Functional Programming Languages and Computer Architecture, August 1991.
B.J. Smith. Architecture and Applications of the HEP Multiprocessor Computer System. Society of Photo-optical Instrumentation Engineers, 298:241–248, 1981.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Agarwal, A., Kranz, D., Barua, R., Natarajan, V. (2001). Optimal Tiling for Minimizing Communication in Distributed Shared-Memory Multiprocessors. In: Pande, S., Agrawal, D.P. (eds) Compiler Optimizations for Scalable Parallel Systems. Lecture Notes in Computer Science, vol 1808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45403-9_9
Download citation
DOI: https://doi.org/10.1007/3-540-45403-9_9
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41945-7
Online ISBN: 978-3-540-45403-8
eBook Packages: Springer Book Archive