Abstract
Classic loop unrolling allows to increase the performance of sequential loops by reducing the overheads of the non-computational parts of the loop. Unfortunately, when the loop contains parallelism inside most compilers will ignore it or perform a naïve transformation.
We propose to extend the semantics of the loop unrolling transformation to cover loops that contain task parallelism. In these cases, the transformation will try to aggregate the multiple tasks that appear after a classic unrolling phase to reduce the overheads per iteration.
We present an implementation of such extended loop unrolling for OpenMP tasks with two phases: a classical unroll followed by a task aggregation phase. Our aggregation technique covers the special cases where task parallelism appears inside branches or where the loop is uncountable.
Our experimental results show that using this extended unroll allows loops with fine-grained tasks to reduce the overheads associated with task creation and obtain a much better scaling.
Chapter PDF
References
Allen, F.E., Cocke, J.: A Catalogue of Optimizing Transformations. Design and Optimization of Compilers, 1–30 (1972)
Allen, R., Kennedy, K.: Automatic Translation of FORTRAN Programs to Vector Form. ACM Trans. Program. Lang. Syst. 9(4), 491–542 (1987)
Chen, C., Charme, J., Hall, M.: CHiLL: A Framework for Composing High-Level Loop Transformations (2008)
Donadio, S., Brodman, J., Roeder, T., Yotov, K., Barthou, D., Cohen, A., Garzarn, M., Padua, D., Pingali, K.: A Language for the Compact Representation of Multiple Program Versions. In: Ayguadé, E., Baumgartner, G., Ramanujam, J., Sadayappan, P. (eds.) LCPC 2005. LNCS, vol. 4339, pp. 136–151. Springer, Heidelberg (2006)
Gerasoulis, A., Yang, T.: On the Granularity and Clustering of Directed Acyclic Task Graphs. IEEE Transactions on Parallel and Distributed Systems 4(6), 686–701 (1993)
Girbal, S., Vasilache, N., Bastoul, C., Cohen, A., Parello, D., Sigler, M., Temam, O.: Semi-automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies. International Journal of Parallel Programming 34(3), 261–317 (2006)
Hall, M.W., Amarasinghe, S.P., Murphy, B.R., Liao, S.w., Lam, M.S.: Detecting Coarse-Grain Parallelism Using an Interprocedural Parallelizing Compiler (1995)
Ishizaka, K., Obata, M., Kasahara, H.: Coarse-Grain Task Parallel Processing Using the OpenMP Backend of the OSCAR Multigrain Parallelizing Compiler. In: Valero, M., Joe, K., Kitsuregawa, M., Tanaka, H. (eds.) ISHPC 2000. LNCS, vol. 1940, pp. 457–470. Springer, Heidelberg (2000)
McCreary, C., Gill, H.: Automatic Determination of Grain Size for Efficient Parallel Processing (1989)
OpenMP Architecture Review Board. OpenMP Application Program Interface, Version 3.0 (May 2008)
Pérez, J.M., Badia, R.M., Labarta, J.: A Flexible and Portable Programming Model for SMP and Multi-cores. Technical report, Barcelona Supercomputing Center-Centro Nacional de Supercomputacin (2007)
Pugh, W.: Uniform Techniques for Loop Optimization. In: ICS 1991: Proceedings of the 5th international conference on Supercomputing, pp. 341–352. ACM, New York (1991)
Teruel, X., Martorell, X., Duran, A., Ferrer, R., Ayguadé, E.: Support for OpenMP Tasks in Nanos v4. In: CAS Conference 2007 (October 2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ferrer, R., Duran, A., Martorell, X., Ayguadé, E. (2010). Unrolling Loops Containing Task Parallelism. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds) Languages and Compilers for Parallel Computing. LCPC 2009. Lecture Notes in Computer Science, vol 5898. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13374-9_30
Download citation
DOI: https://doi.org/10.1007/978-3-642-13374-9_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13373-2
Online ISBN: 978-3-642-13374-9
eBook Packages: Computer ScienceComputer Science (R0)