Skip to main content

Optimal Tiling for Minimizing Communication in Distributed Shared-Memory Multiprocessors

  • Chapter
  • First Online:
Compiler Optimizations for Scalable Parallel Systems

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1808))

  • 484 Accesses

Summary

This paper presents a theoretical framework for automatically partitioning parallel loops and data arrays for cache-coherent NUMA multiprocessors to minimize both cache coherency traffic and remote memory references. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. We also present a heuristic for combined partitioning of loops and data arrays to maximize the probability that references hit in the cache, and to maximize the probability cache misses are satisfied by the local memory. We have implemented this framework in a compiler for Alewife, a distributed shared memory multiprocessor.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Constantine D. Polychronopoulos and David J. Kuck. Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers. IEEE Transactions on Computers, C-36(12), December 1987.

    Google Scholar 

  2. E. Mohr, D. Kranz, and R. Halstead. Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs. IEEE Transactions on Parallel and Distributed Systems, 2(3):264–280, July 1991.

    Article  Google Scholar 

  3. M. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 91 Conference Programming Language Design and Implementation, pages 30–44, 1991.

    Google Scholar 

  4. D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing, 5:587–616, 1988.

    Article  Google Scholar 

  5. Harold S. Stone and Dominique Thiebaut. Footprints in the Cache. In Proceedings of ACM SIGMETRICS 1986, pages 4–8, May 1986.

    Google Scholar 

  6. F. Irigoin and R. Triolet. Supernode Partitioning. In 15th Symposium on Principles of Programming Languages (POPL XV), pages 319–329, January 1988.

    Google Scholar 

  7. S. G. Abraham and D. E. Hudak. Compile-time partitioning of iterative parallel loops to reduce cache coherency traffic. IEEE Transactions on Parallel and Distributed Systems, 2(3):318–328, July 1991.

    Article  Google Scholar 

  8. J. Ramanujam and P. Sadayappan. Compile-Time Techniques for Data Distribution in Distributed Memory Machines. IEEE Transactions on Parallel and Distributed Systems, 2(4):472–482, October 1991.

    Article  Google Scholar 

  9. Jennifer M. Anderson and Monica S. Lam. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. In Proceedings of SIGPLAN’ 93 Conference on Programming Languages Design and Implementation. ACM, June 1993.

    Google Scholar 

  10. M. Gupta and P. Banerjee. Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3(2):179–193, March 1992.

    Article  Google Scholar 

  11. Robert Schreiber and Jack Dongarra. Automatic Blocking of Nested Loops. Technical report, RIACS, NASA Ames Research Center, and Oak Ridge National Laboratory, May 1990.

    Google Scholar 

  12. J. Ferrante, V. Sarkar, and W. Thrash. On Estimating and Enhancing Cache Effectiveness, pages 328–341. Springer-Verlag, August 1991. Lecture Notes in Computer Science: Languages and Compilers for Parallel Computing. Editors U. Banerjee and D. Gelernter and A. Nicolau and D. Padua.

    Google Scholar 

  13. J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for nonshared memory machines. In Proceedings of Supercomputing’ 91. IEEE Computer Society Press, 1991.

    Google Scholar 

  14. G. N. Srinivasa Prasanna, Anant Agarwal, and Bruce R. Musicus. Hierarchical Compilation of Macro Dataflow Graphs for Multiprocessors with Local Memory. IEEE Transactions on Parallel and Distributed Systems, July 1994.

    Google Scholar 

  15. Rajeev Barua, David Kranz, and Anant Agarwal. Global Partitioning of Parallel Loops and Data Arrays for Caches and Distributed Memory in Multiprocessors. Submitted for publication, March 1994.

    Google Scholar 

  16. Monical Lam, Edward E. Rothberg, and Michael E. Wolf. The Cache Performance and Optimizations of Blocked Algorithms. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV), pages 63–74. ACM, April 1991.

    Google Scholar 

  17. A. Agarwal, J. V. Guttag, C. N. Hadjicostis, and M. C. Papaefthymiou. Memory assignment for multiprocessor caches through grey coloring. In PARLE’94 Parallel Architectures and Languages Europe, pages 351–362. Springer-Verlag Lecture Notes in Computer Science 817, July 1994.

    Google Scholar 

  18. A. Carnevali, V. Natarajan, and A. Agarwal. A Relationship between the Number of Lattice Points within Hyperparallelepipeds and their Volume. Motorola Cambridge Research Center. In preparation., August 1993.

    Google Scholar 

  19. Gilbert Strang. Linear algebra and its applications, volume 3rd edition. Harcourt Brace Jovanovich, San Diego, CA, 1988.

    Google Scholar 

  20. A. Schrijver. Theory of Linear and Integer Programming. John Wiley & Sons, 1990.

    Google Scholar 

  21. George Arfken. Mathematical Methods for Physics. Academic Press, 1985.

    Google Scholar 

  22. Y. Ju and H. Dietz. Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation. In Languages and Compilers for Parallel Computing, pages 344–358, Springer Verlag, 1992.

    Google Scholar 

  23. R. Barua, D. Kranz, and A. Agarwal. Communication-Minimal Partitioning of Parallel Loops and Data Arrays for Cache-Coherent Distributed-Memory Multiprocessors. In Languages and Compilters for Parallel Computing. Springer-Verlag Publishers, August 1996.

    Google Scholar 

  24. Anant Agarwal, Ricardo Bianchini, David Chaiken, Kirk Johnson, David Kranz, John Kubiatowicz, Beng-Hong Lim, Kenneth Mackenzie, and Donald Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95), pages 2–13, June 1995.

    Google Scholar 

  25. Paul S. Barth, Rishiyur S. Nikhil, and Arvind. M-Structures: Extending a Parallel, Non-strict, Functional Language with State. In Proceedings of the 5th ACM Conference on Functional Programming Languages and Computer Architecture, August 1991.

    Google Scholar 

  26. B.J. Smith. Architecture and Applications of the HEP Multiprocessor Computer System. Society of Photo-optical Instrumentation Engineers, 298:241–248, 1981.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Agarwal, A., Kranz, D., Barua, R., Natarajan, V. (2001). Optimal Tiling for Minimizing Communication in Distributed Shared-Memory Multiprocessors. In: Pande, S., Agrawal, D.P. (eds) Compiler Optimizations for Scalable Parallel Systems. Lecture Notes in Computer Science, vol 1808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45403-9_9

Download citation

  • DOI: https://doi.org/10.1007/3-540-45403-9_9

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41945-7

  • Online ISBN: 978-3-540-45403-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics