Practical Loop Transformations for Tensor Contraction Expressions on Multi-level Memory Hierarchies

  • Wenjing Ma
  • Sriram Krishnamoorthy
  • Gagan Agrawal
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6601)


Modern architectures are characterized by deeper levels of memory hierarchy, often explicitly addressable. Optimizing applications for such architectures requires careful management of the data movement across all these levels. In this paper, we focus on the problem of mapping tensor contractions to memory hierarchies with more than two levels, specifically addressing placement of memory allocation and data movement statements, choice of loop fusions, and tile size selection. Existing algorithms to find an integrated solution to this problem even for two-level memory hierarchies have been shown to be expensive. We improve upon this work by focusing on the first-order cost components, simplifying the analysis required and reducing the number of candidates to be evaluated. We have evaluated our framework on a cluster of GPUs. Using five candidate tensor contraction expressions, we show that fusion at multiple levels improves performance, and our framework is effective in determining profitable transformations.


Data Movement Local Memory Loop Structure Global Memory Loop Nest 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Ahmed, N., Mateev, N., Pingali, K.: Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. International Journal of Parallel Programming 29(5), 493–544 (2001)CrossRefzbMATHGoogle Scholar
  2. 2.
    Aprà, E., Rendell, A.P., Harrison, R.J., Tipparaju, V., deJong, W.A., Xantheas, S.S.: Liquid water: obtaining the right answer for the right reasons. In: SC (2009)Google Scholar
  3. 3.
    Bartlett, R.J., Musial̈, M.: Coupled-cluster Theory in Quantum Chemistry. Rev. Mod. Phys. 79(1), 291–352 (2007)CrossRefGoogle Scholar
  4. 4.
    Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: PPoPP, pp. 1–10 (2008)Google Scholar
  5. 5.
    Bordawekar, R., Choudhary, A., Kennedy, K., Koelbel, C., Paleczny, M.: A model and compilation strategy for out-of-core data parallel programs. In: PPoPP, pp. 1–10 (July 1995)Google Scholar
  6. 6.
    Brown, A.D., Mowry, T.C., Krieger, O.: Compiler-based i/o prefetching for out-of-core applications. ACM Trans. Comput. Syst. 19(2), 111–170 (2001)CrossRefGoogle Scholar
  7. 7.
    Cascaval, C., Padua, D.A.: Estimating cache misses and locality using stack distances. In: ICS, pp. 150–159 (2003)Google Scholar
  8. 8.
    Coleman, S., McKinley, K.S.: Tile size selection using cache organization and data layout. In: PLDI, pp. 279–290 (1995)Google Scholar
  9. 9.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to algorithms. MIT Press, Cambridge (2001)zbMATHGoogle Scholar
  10. 10.
    Darte, A.: On the complexity of loop fusion. Parallel Computing 26(9), 1175–1193 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Darte, A., Schreiber, R., Villard, G.: Lattice-based memory allocation. IEEE Trans. Computers 54(10), 1242–1257 (2005)CrossRefGoogle Scholar
  12. 12.
    Ding, C., Zhong, Y.: Predicting whole-program locality through reuse distance analysis. In: PLDI, pp. 245–257. ACM, New York (2003)Google Scholar
  13. 13.
    Diouf, B., Ozturk, O., Cohen, A.: Optimizing local memory allocation and assignment through a decoupled approach. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds.) LCPC 2009. LNCS, vol. 5898, pp. 408–415. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  14. 14.
    Gao, X., Krishnamoorthy, S., Sahoo, S.K., Lam, C.-C., Baumgartner, G., Ramanujam, J., Sadayappan, P.: Efficient search-space pruning for integrated fusion and tiling transformations. Concurrency and Computation: Practice and Experience 19(18), 2425–2443 (2007)CrossRefGoogle Scholar
  15. 15.
    Hirata, S.: Tensor contraction engine: Abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories. Journal of Physical Chemistry A 107(46), 9887–9897 (2003)CrossRefGoogle Scholar
  16. 16.
    Hsu, C.-h., Kremer, U.: A quantitative analysis of tile size selection algorithms. J. Supercomput. 27(3), 279–294 (2004)CrossRefzbMATHGoogle Scholar
  17. 17.
    Kandemir, M., Choudhary, A., Choudhary, A.: Compiler optimizations for i/o intensive computations. In: Proceedings of International Conference on Parallel Processing (September 1999)Google Scholar
  18. 18.
    Kandemir, M., Choudhary, A., Ramanujam, J., Bordawekar, R.: Compilation techniques for out-of-core parallel computations. Parallel Computing 24(3-4), 597–628 (1998)CrossRefzbMATHGoogle Scholar
  19. 19.
    Kelly, W., Pugh, W.: Finding legal reordering transformations using mappings. In: Pingali, K.K., Gelernter, D., Padua, D.A., Banerjee, U., Nicolau, A. (eds.) LCPC 1994. LNCS, vol. 892. Springer, Heidelberg (1995)CrossRefGoogle Scholar
  20. 20.
    Li, L., Nguyen, Q.H., Xue, J.: Scratchpad allocation for data aggregates in superperfect graphs. In: LCTES 2007: Proceedings of Conference on Languages, Compilers, and Tools for Embedded Systems, pp. 207–216 (2007)Google Scholar
  21. 21.
    Lim, A.W., Cheong, G.I., Lam, M.S.: An affine partitioning algorithm to maximize parallelism and minimize communication. In: International Conference on Supercomputing, pp. 228–237 (1999)Google Scholar
  22. 22.
    Ma, W., Agrawal, G.: A Translation System for Enabling Data Mining Applications on GPUs. In: Proceedings of International Conference on Supercomputing (ICS) (June 2009)Google Scholar
  23. 23.
    McKinley, K.S., Carr, S., Tseng, C.-W.: Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems 18(4), 424–453 (1996)CrossRefGoogle Scholar
  24. 24.
    McKinley, K.S., Temam, O.: Quantifying loop nest locality using spec’95 and the perfect benchmarks. ACM Trans. Comput. Syst. 17(4), 288–336 (1999)CrossRefGoogle Scholar
  25. 25.
    Mitchell, N., Högstedt, K., Carter, L., Ferrante, J.: Quantifying the multi-level nature of tiling interactions. International Journal of Parallel Programming 26(6), 641–670 (1998)CrossRefGoogle Scholar
  26. 26.
    Moazeni, M., Bui, A., Sarrafzadeh, M.: A Memory Optimization Technique for Software-Managed Scratchpad Memory in GPUs (July 2009),
  27. 27.
    Nieplocha, J., Harrison, R.J., Littlefield, R.J.: Global arrays: A nonuniform memory access programming model for high-performance computers. Journal of Supercomputing 10(2), 169–189 (1996)CrossRefGoogle Scholar
  28. 28.
    Qasem, A., Kennedy, K., Mellor-Crummey, J.M.: Automatic tuning of whole applications using direct search and a performance-based transformation system. The Journal of Supercomputing 36(2), 183–196 (2006)CrossRefGoogle Scholar
  29. 29.
    Ren, M., Park, J.Y., Houston, M., Aiken, A., Dally, W.J.: A tuning framework for software-managed memory hierarchies. In: PACT, pp. 280–291 (2008)Google Scholar
  30. 30.
    Renganarayana, L., Harthikote-matha, M., Dewri, R., Rajopadhye, S.: Towards optimal multi-level tiling for stencil computations. In: IPDPS (2007)Google Scholar
  31. 31.
    Renganarayana, L., Rajopadhye, S.: Positivity, posynomials and tile size selection. In: SC, pp. 1–12 (2008)Google Scholar
  32. 32.
    Rivera, G., wen Tseng, C.: Locality Optimizations for Multi-level Caches. In: Proceedings of the SC 1999 (November 1999)Google Scholar
  33. 33.
    Sahoo, S.K., Krishnamoorthy, S., Panuganti, R., Sadayappan, P.: Integrated loop optimizations for data locality enhancement of tensor contraction expressions. In: SC, p. 13. IEEE Computer Society, Los Alamitos (2005)Google Scholar
  34. 34.
    Sundaram, N., Raghunathan, A., Chakradhar, S.: A framework for efficient and scalable execution of domain-specific templates on GPUs. In: IPDPS (2009)Google Scholar
  35. 35.
    Tarditi, D., Puri, S., Oglesby, J.: Accelerator: using data parallelism to program gpus for general-purpose uses. In: ASPLOS, pp. 325–335 (2006)Google Scholar
  36. 36.
    Thakur, R., Bordawekar, R., Choudhary, A.: Compilation of out-of-core data parallel programs for distributed memory machines. In: Second Annual Workshop on Input/Output in Parallel Computer Systems (IPPS), pp. 54–72 (April 1994)Google Scholar
  37. 37.
    Udayakumaran, S., Dominguez, A., Barua, R.: Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. Embed. Comput. Syst. 5(2), 472–511 (2006)CrossRefGoogle Scholar
  38. 38.
    Yi, Q., Kennedy, K., Adve, V.: Transforming complex loop nests for locality. J. Supercomput. 27(3), 219–264 (2004)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Wenjing Ma
    • 1
  • Sriram Krishnamoorthy
    • 2
  • Gagan Agrawal
    • 1
  1. 1.The Ohio State UniversityColumbusUSA
  2. 2.Pacific Northwest National LabRichlandUSA

Personalised recommendations