An Efficient Semi-Hierarchical Array Layout
Abstract
For high-level programming languages, linear array layout have de facto been the sole form of mapping array elements to memory, to see widespread use. The increasingly deep and complex memory hierarchies present in current computer systems expose several deficiencies of linear array layouts. One such deficiency is that linear array layouts strongly favor locality in one index dimension of multidimensional arrays. Secondly, the exact mapping of array elements to cache locations depend on the array’s size, which effectively renders linear array layouts non-analyzable with respect to cache behavior. We present and evaluate an alternative, semi-hierarchical, array layout which differs from linear array layouts by being neutral with respect to locality in different index dimensions and by enabling accurate and precise analysis of cache behaviors at compile-time.
Keywords
Array Element Iteration Space Cache Line Index Expression Array ReferencePreview
Unable to display preview. Download preview PDF.
References
- [1]Carter, L., Ferrante, J., and Hummel, S. (1995). Hierarchical tiling for improved superscalar performance. In International Parallel Processing Symposium.Google Scholar
- [2]Chatterjee, S., Gilbert, J., Schreiber, R., and Teng, S.-H. (1992). Optimal evaluation of array expressions on massively parallel machines. Technical report, XEROX PARC.Google Scholar
- [3]Chatterjee, S., Jain, V. V., Lebeck, A. R., Mundhra, S., and Thottethodi, M. (1999a). Nonlinear array layouts for hierarchical memory systems. In Proc. 1999 ACM Int. Conf. on Supercomputing,pages 444-453, Rhodes, Greece.Google Scholar
- [4]Chatterjee, S., Lebeck, A. R., Patnala, P. K., and Thottethodi, M. (1999b). Recursive array layouts and fast parallel matrix multiplication. In Proc. Eleventh ACM Symposium on Parallel Algorithms and Architectures, pages 222–231, Saint-Malo, France.CrossRefGoogle Scholar
- [5]Cmelik, R. F. (1993). Spixtools user’s manual. Technical Report SMLI TR-93–6, Sun Microsystems Labs, Mountain View, CA.Google Scholar
- [6]Coleman, S. and McKinley, K. S. (1995). Tile size selection using cache organization and data layout. In Proc. ACM Conf. on Programming Language Design and Implementation, pages 279–290, La Jolla, CA.Google Scholar
- [7]Drakenberg, N. P. (2001). Hierarchical Array Tiling. Licentiate thesis, Department of Teleinformatics, Royal Institute of Technology, Stockholm. In preparation. http://www.it.kth.se/-npd/lic-thesis.ps.
- [8]Gargantini, I. (1982). Linear octtrees for fast processing of three-dimensional objects. Comput. Graphics Image Process., 20: 365–374.CrossRefGoogle Scholar
- [9]Ghosh, S., Martonosi, M., and Malik, S. (1997). Cache miss equations: An analytical representation of cache misses. In Proc. 1997 International Conference on Supercomputing, pages 317–324, Vienna, Austria.Google Scholar
- [10]Ghosh, S., Martonosi, M., and Malik, S. (1998). Precise miss analysis for program transformations with caches of arbitrary associativity. In Proc. 8th Int. Conf. on Architectural Support for Programming Languages and Operating Systems,Vienna, Austria.Google Scholar
- [11]Gupta, M. (1992). Automatic Data Partitioning on Distributed Memory Multicomputers. PhD thesis, University of Illinios at Urbana-Champaign, Urbana, IL.Google Scholar
- [12]Hu, Y., Johnsson, S., and Teng, S.-H. (1997). High Performance Fortran for highly irregular problems. In Proc. Sixth ACM SIG-PLAN Symp. on Principles and Practice of Parallel Programming, pages 13–24, Las Vegas, NV.CrossRefGoogle Scholar
- [13]Knobe, K., Lucas, J. D., and Daily, W. J. (1992). Dynamic alignment on distributed memory systems. In Proc. 3rd Workshop on Compilers for Parallel Computers, pages 394–404.Google Scholar
- [14]Lam, M. S., Rothberg, E. E., and Wolf, M. E. (1991). The cache performance and optimizations of blocked algorithms. In Proc. 4th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 63–74.Google Scholar
- [15]Mitchell, N., Carter, L., and Ferrante, J. (1997). A compiler perspective on architectural evolutions. In Workshop on Interaction between Compilers and Computer Architectures,San Antonio, Texas.Google Scholar
- [16]Morton, G. M. (1966). A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario.Google Scholar
- [17]Pilkington, J. and Baden, S. (1996). Dynamic partitioning of nonuniform structured workloads with spacefilling curves. IEEE Trans. on Parallel and Distributed Systems, 7: 288–300.CrossRefGoogle Scholar
- [18]Rivera, G. and Tseng, C.-W. (1998a). Data transformations for eliminating conflict misses. In Proc. ACM SIGPLAN’98 Conference on Programming Language Design and Implementation, pages 3849, Montreal, Canada.Google Scholar
- [19]Rivera, G. and Tseng, C.-W. (1998b). Eliminating conflict misses for high performance architectures. In Proc. 1998 International Conference on Supercomputing, pages 353–360, Melbourne, Australia.Google Scholar
- [20]Schrack, G. (1992). Finding neighbors of equal size in linear quadtrees and octtrees in constant time. CVGIP: Image Underst., 55 (3): 221–230.MATHCrossRefGoogle Scholar
- [21]Schrijver, A. (1986). Teory of Linear and Integer Programming. John Wiley and Sons, Chichester.Google Scholar
- [22]Temam, O., Granston, E. D., and Jalby, W. (1993). To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In Proc. Supercomputing ‘83, Portland, OR.Google Scholar
- [23]Tocher, K. (1954). The application of computers to sampling experiments. J. Roy. Statist. Soc., 16 (1): 39–61.MathSciNetMATHGoogle Scholar
- [24]Vera, X., Llosa, J., Gonzalez, A., and Ciuraneta, C. (2000). A fast implementation of cache miss equations. In Proc. 8th Workshop on Compilers for Parallel Computers, pages 321–328, Aussois, France.Google Scholar
- [25]Wise, D. S. (2000). Ahnentafel indexing into morton-ordered arrays, or matrix locality for free. In Bode, A. et al., editors, Proc. Euro-Par 2000, pages 774–783. Springer-Verlag.Google Scholar
- [26]Wolfe, M. (1996). High Performance Compilers for Parallel Computing. Addison-Wesley, Redwood City, CA.Google Scholar
- [27]Woodwark, J. R. (1982). The explicit quadtree as a structure for computer graphics. Comput. J., 25 (2): 235–238.CrossRefGoogle Scholar