Adaptive Loop Tiling for a Multi-cluster CMP

Zhao, Jisheng; Horsnell, Matthew; Luján, Mikel; Rogers, Ian; Kirkham, Chris; Watson, Ian

doi:10.1007/978-3-540-69501-1_23

Adaptive Loop Tiling for a Multi-cluster CMP

Jisheng Zhao¹,
Matthew Horsnell¹,
Mikel Luján¹,
Ian Rogers¹,
Chris Kirkham¹ &
…
Ian Watson¹

Conference paper

734 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5022))

Abstract

Loop tiling is a fundamental optimization for improving data locality. Selecting the right tile size combined with the parallelization of loops can provide additional performance increases in the modern of Chip MultiProcessor (CMP) architectures. This paper presents a runtime optimization system which automatically parallelizes loops and searches empirically for the best tile sizes on a scalable multi-cluster CMP. The system is built on top of a virtual machine and targets the runtime parallelization and optimization of Java programs. Experimental results show that runtime parallelization and tile size searching are capable of improving performance for two BLAS kernels and one Lattice-Boltzmann simulation, despite overheads.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Lattice boltzmann method, http://www.latticeboltzmann.com/
The Jamaica Project (May 2005), http://www.cs.manchester.ac.uk/apt/projects/jamaica
Anderson, E., Bai, Z., Bischof, C., Blackford, L.S., Demmel, J., Dongarra, J.J., Du Croz, J., Hammarling, S., Greenbaum, A., McKenney, A., Sorensen, D.: LAPACK Users’ guide, 3rd edn. Society for Industrial and Applied Mathematics, Philadelphia (1999)
Google Scholar
Arnold, M., Fink, S.J., Grove, D., Hind, M., Sweeney, P.F.: Adaptive optimization in the Jalapeño JVM. In: ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 47–65 (2000)
Google Scholar
Burke, M., Choi, J., Fink, S., Grove, D., Hind, M., Sarkar, V., Serrano, M., Sreedhar, V., Srinivasan, H., Whaley, J.: The Jalapeño dynamic optimizing compiler for Java. In: Proceedings ACM 1999 Java Grande Conference, San Francisco, CA, United States, June 1999, pp. 129–141. ACM (1999)
Google Scholar
Carr, S., Kennedy, K.: Compiler blockability of numerical algorithms. Supercomputing, 114–124 (1992)
Google Scholar
Coleman, S., McKinley, K.S.: Tile size selection using cache organization and data layout. In: SIGPLAN Conference on Programming Language Design and Implementation, pp. 279–290. ACM Press, New York (1995)
Chapter Google Scholar
Fursin, G., Cohen, A., O’Boyle, M., Temam, O.: Quick and practical run-time evaluation of multiple program optimizations. Transactions on High-Performance Embedded Architectures and Compilers 1(1), 13–31 (2006)
Google Scholar
Hammond, L., Hubbard, B.A., Siu, M., Prabhu, M.K., Chen, M., Olukotun, K.: The Stanford Hydra CMP. IEEE Micro, 71–84 (March–April 2000)
Google Scholar
Horsnell, M.J.: A chip multi-cluster architecture with locality aware task distribution. PhD thesis, The University of Manchester (2007)
Google Scholar
Kisuki, T., Knijnenburg, P.M.W., O’Boyle, M.F.P.: Combined selection of tile sizes and unroll factors using iterative compilation. In: International Conference on Parallel Architectures and Compilation Techniques, pp. 237–246 (2000)
Google Scholar
Kongetira, P., Aingaran, K., Olukotun, K.: Niagara: A 32-way multithreaded sparc processor. IEEE Micro 25(2), 21–29 (2005)
Article Google Scholar
Lam, M.S., Rothberg, E.E., Wolf, M.E.: The cache performance and optimizations of blocked algorithms. In: International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 63–74 (1991)
Google Scholar
Voss, M., Eigenmann, R.: High-level adaptive program optimization with ADAPT. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 93–102 (2001)
Google Scholar
Whaley, R.C., Petitet, A.: Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience 35(2), 101–121 (2005)
Article Google Scholar
Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27(1–2), 3–35 (2001)
Article MATH Google Scholar
Wolfe, M.J.: High performance compilers for parallel computing. Addison-Wesley, Redwood City (1996)
MATH Google Scholar
Wright, G.: A single-chip multiprocessor architecture with hardware thread support. PhD thesis, The University of Manchester (2001)
Google Scholar
Zhao, J., Horsnell, M., Rogers, I., Dinn, A., Kirkham, C.C., Watson, I.: Optimizing chip multiprocessor work distribution using dynamic compilation. In: Kermarrec, A.-M., Bougé, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 258–267. Springer, Heidelberg (2007)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

University of Manchester, UK
Jisheng Zhao, Matthew Horsnell, Mikel Luján, Ian Rogers, Chris Kirkham & Ian Watson

Authors

Jisheng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Horsnell
View author publications
You can also search for this author in PubMed Google Scholar
Mikel Luján
View author publications
You can also search for this author in PubMed Google Scholar
Ian Rogers
View author publications
You can also search for this author in PubMed Google Scholar
Chris Kirkham
View author publications
You can also search for this author in PubMed Google Scholar
Ian Watson
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Anu G. Bourgeois S. Q. Zheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, J., Horsnell, M., Luján, M., Rogers, I., Kirkham, C., Watson, I. (2008). Adaptive Loop Tiling for a Multi-cluster CMP. In: Bourgeois, A.G., Zheng, S.Q. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2008. Lecture Notes in Computer Science, vol 5022. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69501-1_23

Download citation

DOI: https://doi.org/10.1007/978-3-540-69501-1_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69500-4
Online ISBN: 978-3-540-69501-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics