Abstract
The tighter integration on chip multiprocessors exerts a higher pressure on off-chip accesses to the memory system. This makes minimizing the number of off-chip accesses a critical optimization goal. This paper discusses a compiler-based solution to this problem for the embedded applications that perform stencil computations. An important characteristic of this solution is that it distinguishes between the intra-processor data reuse and inter-processor data reuse. The first of these captures the data reuse that occurs across loop iterations assigned to the same processor, whereas the second one represents the data reuse that takes place across the loop iterations assigned to different processors. The proposed approach then optimizes inter-processor reuse by re-organizing the loop iterations of each processor carefully, considering how data elements are shared across processors. The goal is to ensure that the different processors access the shared data within a short period of time, so that the data can be captured in the on-chip memory space at the time of the reuse. This paper also presents an evaluation of the proposed optimization and compares it to an alternate scheme that optimizes data locality for each processor in isolation. The results obtained by applying our implementation to eight loop-intensive benchmark codes from the embedded computing domain show that our approach improves over the mentioned alternate scheme by 15.6% on average.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Allen, R., Kennedy, K.: Automatic translation of FORTRAN programs to vector form. ACM Transactions on Programming Languages and Systems 9(4), 491–542 (1987)
Banerjee, U.: A theory of loop permutations. In: Proc. 2nd Workshop on Languages and Compilers for Parallel Computing, August (1989)
Bareiss, E.H.: Sylvester’s Identity and Multistep Integer-Preserving Gaussian Elimination. Mathematics of Computation 22(103), 565–578 (1968)
Barroso, L.A., Gharachorloo, K., McNamara, R., Nowatzyk, A., Qadeer, S., Sano, B., Smith, S., Stets, R., Verghese, B.: Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In: Proceedings of International Symposium on Computer Architecture (2000)
Bordawekar, R., Choudhary, A., Ramanujam, J.: Automatic optimization of communication in compiling out-of-core stencil codes. In: Proc. ACM International Conference on Supercomputing, May, ACM Press, New York (1996)
Brickner, R.G., George, W., Johnsson, S.L., Ruttenberg, A.: A stencil compiler for the connection machine models CM-2/200. Technical Report TR-22-93, Center for Research in Computing Technology, Harvard University (December 1993)
Brickner, R.G., Holian, K., Thiagarajan, B., Johnsson, S.L.: A stencil compiler for the Connection Machine model CM-5. Technical Report CRPC-TR94457, Center for Research on Parallel Computation, Rice University (June 1994)
Bromley, M., Heller, S., McNerney, T., Steele Jr., G.L.: Fortran at ten gigaflops: the connection machine convolution compiler. In: Proc. ACM Conference on Programming Language Design and Implementation, June, ACM Press, New York (1991)
Cabay, S.: Exact solution of linear equations. In: Proc. ACM Symposium on Symbolic and Algebraic Manipulation, pp. 392–398. ACM Press, New York (1971)
Culler, D., Singh, J.P., Gupta, A.: Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, San Francisco (1998)
Davis, K., Bassetti, F.: Exploiting temporal locality in stencil based applications. In: Proc. International Conference on Information Systems Analysis and Synthesis (1999)
Gomaa, M., Scarbrough, C., Vijaykumar, T.N., Pomeranz, I.: Transient-fault recovery for chip multiprocessors. In: Proc. International Symposium on Computer Architecture (2003)
Gschwind, M., Hofstee, P., Flachs, B., Hopkins, M., Watanabe, Y., Yamazaki, T.: A novel SIMD architecture for the Cell heterogeneous chip-multiprocessor. Hot Chips 17 (2005)
Hammond, L., Nayfeh, B.A., Olukotun, K.: A single-chip multiprocessor. IEEE Computer Special Issue on ”Billion-Transistor Processors” (September 1997)
Hetheringtonh, R.: The UltraSPARC T1 Processor - Power Efficient Throughput Computing. Sun White Paper (December 2005)
Lee, F.F.: Partitioning of regular computation on multiprocessor systems. Journal of Parallel and Distributed Computing 9, 312–317 (1990)
Leung, S.-T., Zahorjan, J.: Optimizing data locality by array restructuring. Technical Report 95-09-01, University of Washington (September 1995)
Li, W., Pingali, K.: A singular loop transformation framework based on non-singular matrices. In: Proc. 5th Workshop on Languages and Compilers for Parallel Computing, Yale University, August (1992)
MAJC-5200. http://www.sun.com/microelectronics/MAJC/5200wp.html
MP98: A Mobile Processor. http://www.labs.nec.co.jp/MP98/top-e.htm
Nayfeh, B.A., Olukotun, K.: Exploring the design space for a shared-cache multiprocessor. In: Proc. International Symposium on Computer Architecture (1994)
Olukotun, K., Hammond, L.: The future of microprocessors. ACM QUEUE Magazine (September 2005)
POWER4 System Microarchitecture, White Paper. http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/power4.html
Richardson, S.: MPOC: A chip multiprocessor for embedded systems. Technical Report HPL-2002-186, HP Labs (2002)
Roth, G., Mellor-Crummey, J., Kennedy, K., Brickner, R.G.: Compiling stencils in high performance Fortran. In: Proc. ACM/IEEE conference on Supercomputing, IEEE Computer Society Press, Los Alamitos (1997)
SIMICS Toolset. http://www.virtutech.com
SUIF Compiler Infrastructure. http://suif.stanford.edu/
Wolf, W.: The future of multiprocessor systems-on-chips. In: Proc. ACM Design Automation Conference, ACM Press, New York (2004)
Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: Proc. ACM Conference on Programming Language Design and Implementation, June, pp. 30–44. ACM Press, New York (1991)
Wolf, M.E., Lam, M.S.: A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems 2(4), 452–471 (1991)
Wolfe, M.J.: Optimizing Supercompilers for Supercomputers. MIT Press, Cambridge (1989)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, G., Kandemir, M. (2007). An Approach for Enhancing Inter-processor Data Locality on Chip Multiprocessors. In: Stenström, P. (eds) Transactions on High-Performance Embedded Architectures and Compilers I. Lecture Notes in Computer Science, vol 4050. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71528-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-540-71528-3_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71527-6
Online ISBN: 978-3-540-71528-3
eBook Packages: Computer ScienceComputer Science (R0)