Abstract
Loop tiling is a fundamental optimization for improving data locality. Selecting the right tile size combined with the parallelization of loops can provide additional performance increases in the modern of Chip MultiProcessor (CMP) architectures. This paper presents a runtime optimization system which automatically parallelizes loops and searches empirically for the best tile sizes on a scalable multi-cluster CMP. The system is built on top of a virtual machine and targets the runtime parallelization and optimization of Java programs. Experimental results show that runtime parallelization and tile size searching are capable of improving performance for two BLAS kernels and one Lattice-Boltzmann simulation, despite overheads.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Lattice boltzmann method, http://www.latticeboltzmann.com/
The Jamaica Project (May 2005), http://www.cs.manchester.ac.uk/apt/projects/jamaica
Anderson, E., Bai, Z., Bischof, C., Blackford, L.S., Demmel, J., Dongarra, J.J., Du Croz, J., Hammarling, S., Greenbaum, A., McKenney, A., Sorensen, D.: LAPACK Users’ guide, 3rd edn. Society for Industrial and Applied Mathematics, Philadelphia (1999)
Arnold, M., Fink, S.J., Grove, D., Hind, M., Sweeney, P.F.: Adaptive optimization in the Jalapeño JVM. In: ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 47–65 (2000)
Burke, M., Choi, J., Fink, S., Grove, D., Hind, M., Sarkar, V., Serrano, M., Sreedhar, V., Srinivasan, H., Whaley, J.: The Jalapeño dynamic optimizing compiler for Java. In: Proceedings ACM 1999 Java Grande Conference, San Francisco, CA, United States, June 1999, pp. 129–141. ACM (1999)
Carr, S., Kennedy, K.: Compiler blockability of numerical algorithms. Supercomputing, 114–124 (1992)
Coleman, S., McKinley, K.S.: Tile size selection using cache organization and data layout. In: SIGPLAN Conference on Programming Language Design and Implementation, pp. 279–290. ACM Press, New York (1995)
Fursin, G., Cohen, A., O’Boyle, M., Temam, O.: Quick and practical run-time evaluation of multiple program optimizations. Transactions on High-Performance Embedded Architectures and Compilers 1(1), 13–31 (2006)
Hammond, L., Hubbard, B.A., Siu, M., Prabhu, M.K., Chen, M., Olukotun, K.: The Stanford Hydra CMP. IEEE Micro, 71–84 (March–April 2000)
Horsnell, M.J.: A chip multi-cluster architecture with locality aware task distribution. PhD thesis, The University of Manchester (2007)
Kisuki, T., Knijnenburg, P.M.W., O’Boyle, M.F.P.: Combined selection of tile sizes and unroll factors using iterative compilation. In: International Conference on Parallel Architectures and Compilation Techniques, pp. 237–246 (2000)
Kongetira, P., Aingaran, K., Olukotun, K.: Niagara: A 32-way multithreaded sparc processor. IEEE Micro 25(2), 21–29 (2005)
Lam, M.S., Rothberg, E.E., Wolf, M.E.: The cache performance and optimizations of blocked algorithms. In: International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 63–74 (1991)
Voss, M., Eigenmann, R.: High-level adaptive program optimization with ADAPT. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 93–102 (2001)
Whaley, R.C., Petitet, A.: Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience 35(2), 101–121 (2005)
Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27(1–2), 3–35 (2001)
Wolfe, M.J.: High performance compilers for parallel computing. Addison-Wesley, Redwood City (1996)
Wright, G.: A single-chip multiprocessor architecture with hardware thread support. PhD thesis, The University of Manchester (2001)
Zhao, J., Horsnell, M., Rogers, I., Dinn, A., Kirkham, C.C., Watson, I.: Optimizing chip multiprocessor work distribution using dynamic compilation. In: Kermarrec, A.-M., Bougé, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 258–267. Springer, Heidelberg (2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhao, J., Horsnell, M., Luján, M., Rogers, I., Kirkham, C., Watson, I. (2008). Adaptive Loop Tiling for a Multi-cluster CMP. In: Bourgeois, A.G., Zheng, S.Q. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2008. Lecture Notes in Computer Science, vol 5022. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69501-1_23
Download citation
DOI: https://doi.org/10.1007/978-3-540-69501-1_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69500-4
Online ISBN: 978-3-540-69501-1
eBook Packages: Computer ScienceComputer Science (R0)