Abstract
In this paper, we introduce a novel approach to guide tile size selection by employing analytical models to limit empirical search within a subspace of the full search space. Two analytical models are used together: 1) an existing conservative model, based on the data footprint of a tile, which ignores intra-tile cache block replacement, and 2) an aggressive new model that assumes optimal cache block replacement within a tile. Experimental results on multiple platforms demonstrate the practical effectiveness of the approach by reducing the search space for the optimal tile size by 1,307× to 11,879× for an Intel Core-2-Quad system; 358× to 1,978× for an Intel Nehalem system; and 45× to 1,142× for an IBM Power7 system. The execution of rectangularly tiled code tuned by a search of the subspace identified by our model achieves speed-ups of up to 1.40× (Intel Core-2 Quad), 1.28× (Nehalem) and 1.19× (Power 7) relative to the best possible square tile sizes on these different processor architectures. We also demonstrate the integration of the analytical bounds with existing search optimization algorithms. Our approach not only reduces the total search time from Nelder-Mead Simplex and Parallel Rank Ordering methods by factors of up to 4.95× and 4.33×, respectively, but also finds better tile sizes that yield higher performance in tuned tiled code.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Barr, T.W., Cox, A.L., Rixner, S.: Translation caching: skip, don’t walk (the page table). In: ISCA 2010, pp. 48–59. ACM, New York (2010)
Baskaran, M., Hartono, A., Tavarageri, S., Henretty, T., Ramanujam, J., Sadayappan, P.: Parameterized tiling revisited. In: CGO, pp. 200–209 (2010)
Bhargava, R., Serebrin, B., Spadini, F., Manne, S.: Accelerating two-dimensional page walks for virtualized systems. In: ASPLOS XIII, pp. 26–35 (2008)
Bilmes, J., Asanovic, K., Chin, C., Demmel, J.: Optimizing matrix multiply using PHiPAC. In: Proc. ICS, pp. 340–347 (1997)
Bodin, F., Jalby, W., Windheiser, D., Eisenbeis, C.: A quantitative algorithm for data locality optimization. In: Code Generation, pp. 119–145 (1991)
Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral program optimization system. In: PLDI (2008)
Boulet, P., Darte, A., Risset, T., Robert, Y. (Pen)-ultimate tiling? Integration, the VLSI Journal 17(1), 33–51 (1994)
Chame, J., Moon, S.: A tile selection algorithm for data locality and cache interference. In: ICS, pp. 492–499 (1999)
Chen, C., Chame, J., Hall, M.: Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In: CGO 2005 (2005)
Coleman, S., McKinley, K.: Tile Size Selection Using Cache Organization and Data Layout. In: PLDI, pp. 279–290 (1995)
Datta, K.: Auto-tuning stencil codes for cache-based multicore platforms. Technical report, University of California, Berkeley (December 2009)
Ferrante, J., Sarkar, V., Thrash, W.: On Estimating and Enhancing Cache Effectiveness. In: Banerjee, U., Nicolau, A., Gelernter, D., Padua, D.A. (eds.) LCPC 1991. LNCS, vol. 589, pp. 328–343. Springer, Heidelberg (1992)
Ghosh, S., Martonosi, M., Malik, S.: Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM TOPLAS 21(4), 703–746 (1999)
Goto, K., van de Geijn, R.A.: High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1) (July 2008)
Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Proc. ICS (2009)
Hsu, C., Kremer, U.: A quantitative analysis of tile size selection algorithms. J. Supercomput. 27(3), 279–294 (2004)
Irigoin, F., Triolet, R.: Supernode partitioning. In: ACM POPL, pp. 319–329 (1988)
Kim, D., Renganarayanan, L., Strout, M., Rajopadhye, S.: Multi-level tiling: ’m’ for the price of one. In: SC (2007)
Knijnenburg, P.M.W., Kisuki, T., O’Boyle, M.F.P.: Combined selection of tile sizes and unroll factors using iterative compilation. The Journal of Supercomputing 24(1), 43–67 (2003)
Lam, M., Rothberg, E., Wolf, M.: The cache performance and optimizations of blocked algorithms. In: Proc. 4th ACM ASPLOS, pp. 63–74 (1991)
Luersen, M., Riche, R.L., Guyon, F.: A constrained, globalized, and bounded nelder-mead method for engineering optimization. Structural and Multidisciplinary Optimization 27(1-2), 43–54 (2004)
Nelder, J.A., Mead, R.: A simplex method for function minimization. Computer Journal 7(4), 308–313 (1965)
Ramanujam, J., Sadayappan, P.: Tiling multidimensional iteration spaces for multicomputers. JPDC 16(2), 108–230 (1992)
Renganarayana, L., Kim, D., Rajopadhye, S., Strout, M.: Parameterized tiled loops for free. In: PLDI, pp. 405–414 (2007)
Resource Characterization in the PACE Project, http://www.pace.rice.edu/Content.aspx?id=41
Rivera, G., Tseng, C.: Locality optimizations for multi-level caches. In: SC (1999)
Sarkar, V.: Automatic Selection of High Order Transformations in the IBM XL Fortran Compilers. IBM J. Res. & Dev. 41(3) (May 1997)
Sarkar, V., Megiddo, N.: An analytical model for loop tiling and its solution. In: IEEE ISPASS (2000)
Schreiber, R., Dongarra, J.: Automatic blocking of nested loops. Tech. Report 90.38, RIACS, NASA Ames Research Center (1990)
Tabatabaee, V., Tiwari, A., Hollingsworth, J.K.: Parallel parameter tuning for applications with performance variability. In: Proc. Supercomputing 2005 (2005)
Tapus, C., Chung, I.-H., Hollingsworth, J.K.: Active harmony: towards automated performance tuning. In: SC, pp. 1–11 (2002)
Tiwari, A., Chen, C., Chame, J., Hall, M., Hollingsworth, J.: Scalable autotuning framework for compiler optimization. In: IPDPS 2009 (2009)
Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27(1–2), 3–35 (2001)
Wolf, M., Lam, M.S.: A data locality optimizing algorithm. In: PLDI 1991, pp. 30–44 (1991)
Wolfe, M.: More iteration space tiling. In: Proc. Supercomputing, pp. 655–664 (1989)
Xue, J.: Loop tiling for parallelism. Kluwer Academic Publishers, Norwell (2000)
Yotov, K., Pingali, K., Stodghill, P.: Think globally, search locally. In: International Conference on Supercomputing (2005)
Yuki, T., Renganarayanan, L., Rajopadhye, S., Anderson, C., Eichenberger, A., O’Brien, K.: Automatic creation of tile size selection models. In: CGO, pp. 190–199 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shirako, J. et al. (2012). Analytical Bounds for Optimal Tile Size Selection. In: O’Boyle, M. (eds) Compiler Construction. CC 2012. Lecture Notes in Computer Science, vol 7210. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28652-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-28652-0_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28651-3
Online ISBN: 978-3-642-28652-0
eBook Packages: Computer ScienceComputer Science (R0)