Optimal Tiling for Minimizing Communication in Distributed Shared-Memory Multiprocessors

Agarwal, Anant; Kranz, David; Barua, Rajeev; Natarajan, Venkat

doi:10.1007/3-540-45403-9_9

Anant Agarwal⁶,
David Kranz⁶,
Rajeev Barua⁷ &
…
Venkat Natarajan⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1808))

484 Accesses

Summary

This paper presents a theoretical framework for automatically partitioning parallel loops and data arrays for cache-coherent NUMA multiprocessors to minimize both cache coherency traffic and remote memory references. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. We also present a heuristic for combined partitioning of loops and data arrays to maximize the probability that references hit in the cache, and to maximize the probability cache misses are satisfied by the local memory. We have implemented this framework in a compiler for Alewife, a distributed shared memory multiprocessor.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Constantine D. Polychronopoulos and David J. Kuck. Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers. IEEE Transactions on Computers, C-36(12), December 1987.
Google Scholar
E. Mohr, D. Kranz, and R. Halstead. Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs. IEEE Transactions on Parallel and Distributed Systems, 2(3):264–280, July 1991.
Article Google Scholar
M. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 91 Conference Programming Language Design and Implementation, pages 30–44, 1991.
Google Scholar
D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing, 5:587–616, 1988.
Article Google Scholar
Harold S. Stone and Dominique Thiebaut. Footprints in the Cache. In Proceedings of ACM SIGMETRICS 1986, pages 4–8, May 1986.
Google Scholar
F. Irigoin and R. Triolet. Supernode Partitioning. In 15th Symposium on Principles of Programming Languages (POPL XV), pages 319–329, January 1988.
Google Scholar
S. G. Abraham and D. E. Hudak. Compile-time partitioning of iterative parallel loops to reduce cache coherency traffic. IEEE Transactions on Parallel and Distributed Systems, 2(3):318–328, July 1991.
Article Google Scholar
J. Ramanujam and P. Sadayappan. Compile-Time Techniques for Data Distribution in Distributed Memory Machines. IEEE Transactions on Parallel and Distributed Systems, 2(4):472–482, October 1991.
Article Google Scholar
Jennifer M. Anderson and Monica S. Lam. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. In Proceedings of SIGPLAN’ 93 Conference on Programming Languages Design and Implementation. ACM, June 1993.
Google Scholar
M. Gupta and P. Banerjee. Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3(2):179–193, March 1992.
Article Google Scholar
Robert Schreiber and Jack Dongarra. Automatic Blocking of Nested Loops. Technical report, RIACS, NASA Ames Research Center, and Oak Ridge National Laboratory, May 1990.
Google Scholar
J. Ferrante, V. Sarkar, and W. Thrash. On Estimating and Enhancing Cache Effectiveness, pages 328–341. Springer-Verlag, August 1991. Lecture Notes in Computer Science: Languages and Compilers for Parallel Computing. Editors U. Banerjee and D. Gelernter and A. Nicolau and D. Padua.
Google Scholar
J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for nonshared memory machines. In Proceedings of Supercomputing’ 91. IEEE Computer Society Press, 1991.
Google Scholar
G. N. Srinivasa Prasanna, Anant Agarwal, and Bruce R. Musicus. Hierarchical Compilation of Macro Dataflow Graphs for Multiprocessors with Local Memory. IEEE Transactions on Parallel and Distributed Systems, July 1994.
Google Scholar
Rajeev Barua, David Kranz, and Anant Agarwal. Global Partitioning of Parallel Loops and Data Arrays for Caches and Distributed Memory in Multiprocessors. Submitted for publication, March 1994.
Google Scholar
Monical Lam, Edward E. Rothberg, and Michael E. Wolf. The Cache Performance and Optimizations of Blocked Algorithms. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV), pages 63–74. ACM, April 1991.
Google Scholar
A. Agarwal, J. V. Guttag, C. N. Hadjicostis, and M. C. Papaefthymiou. Memory assignment for multiprocessor caches through grey coloring. In PARLE’94 Parallel Architectures and Languages Europe, pages 351–362. Springer-Verlag Lecture Notes in Computer Science 817, July 1994.
Google Scholar
A. Carnevali, V. Natarajan, and A. Agarwal. A Relationship between the Number of Lattice Points within Hyperparallelepipeds and their Volume. Motorola Cambridge Research Center. In preparation., August 1993.
Google Scholar
Gilbert Strang. Linear algebra and its applications, volume 3rd edition. Harcourt Brace Jovanovich, San Diego, CA, 1988.
Google Scholar
A. Schrijver. Theory of Linear and Integer Programming. John Wiley & Sons, 1990.
Google Scholar
George Arfken. Mathematical Methods for Physics. Academic Press, 1985.
Google Scholar
Y. Ju and H. Dietz. Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation. In Languages and Compilers for Parallel Computing, pages 344–358, Springer Verlag, 1992.
Google Scholar
R. Barua, D. Kranz, and A. Agarwal. Communication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors. In Languages and Compilters for Parallel Computing. Springer-Verlag Publishers, August 1996.
Google Scholar
Anant Agarwal, Ricardo Bianchini, David Chaiken, Kirk Johnson, David Kranz, John Kubiatowicz, Beng-Hong Lim, Kenneth Mackenzie, and Donald Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95), pages 2–13, June 1995.
Google Scholar
Paul S. Barth, Rishiyur S. Nikhil, and Arvind. M-Structures: Extending a Parallel, Non-strict, Functional Language with State. In Proceedings of the 5th ACM Conference on Functional Programming Languages and Computer Architecture, August 1991.
Google Scholar
B.J. Smith. Architecture and Applications of the HEP Multiprocessor Computer System. Society of Photo-optical Instrumentation Engineers, 298:241–248, 1981.
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 02139
Anant Agarwal & David Kranz
Department of Electrical & Computer Engineering, University of Maryland, College Park, MD, 20742
Rajeev Barua
Wireless Systems Center, Semiconductor Products Sector Motorola, USA
Venkat Natarajan

Authors

Anant Agarwal
View author publications
You can also search for this author in PubMed Google Scholar
David Kranz
View author publications
You can also search for this author in PubMed Google Scholar
Rajeev Barua
View author publications
You can also search for this author in PubMed Google Scholar
Venkat Natarajan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta, GA, 30332, USA
Santosh Pande
Department of ECECS, University of Cincinnati, P.O. Box 210030, Cincinnati, OH, 45221-0030, USA
Dharma P. Agrawal

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Agarwal, A., Kranz, D., Barua, R., Natarajan, V. (2001). Optimal Tiling for Minimizing Communication in Distributed Shared-Memory Multiprocessors. In: Pande, S., Agrawal, D.P. (eds) Compiler Optimizations for Scalable Parallel Systems. Lecture Notes in Computer Science, vol 1808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45403-9_9

Download citation

DOI: https://doi.org/10.1007/3-540-45403-9_9
Published: 18 May 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41945-7
Online ISBN: 978-3-540-45403-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics