Balanced, Locality-Based Parallel Irregular Reductions
- 299 Downloads
Much effort has been devoted recently to efficiently parallelize irregular reductions. Different parallelization techniques have been proposed during the last years that can be classified into two groups: LPO (Loop Partitioning Oriented methods) and DPO (Data Partitioning Oriented methods). We have analyzed both classes in terms of a set of performance aspects: data locality, memory overhead, parallelism and workload balancing. Load balancing is not an issue sufficiently analyzed in the literature in parallel reduction methods, specially those in the DPO class. In this paper we propose two techniques to introduce load balancing into a DPO method. The first technique is generic, as it can deal with any kind of load unbalancing present in the problem domain. The second technique handles a special case of load unbalancing, appearing when there are a large number of write operations on small regions of the reduction arrays. Efficient implementations of the proposed solutions to load balancing for an example DPO method are presented. Experiments on static and dynamic kernel codes were conducted making comparisons with other parallel reduction methods.
KeywordsLoop Iteration Execution Phase Memory Overhead Workload Balance Parallel Thread
Unable to display preview. Download preview PDF.
- R. Asenjo, E. Gutiérrez, Y. Lin, D. Padua, B. Pottengerg, and E. Zapata. On the Automatic Parallelization of Sparse and Irregular Fortran Codes. Technical Report 1512, University for Illinois at Urbana-Champaign, Center for Supercomputing R&D., December 1996.Google Scholar
- T. Davis, The University of Florida Sparse Matrix Collection. NA Digest, 97(23), June 1997.Google Scholar
- C. Ding and K. Kennedy, Improving Cache Performance of Dynamic Applications with Computation and Data Layout Transformations. In Proceedings of the ACM International Conference on Programming Language Design and Implementation (PLDI’99), pages 229–241, Atlanta, GA, May 1999.Google Scholar
- E. Gutiérrez, O. Plata, and E.L. Zapata. An Automatic Parallelization of Irregular Reductions on Scalable Shared Memory Multiprocessors. In Proceedings of the 5th International Euro-Par Conference (EuroPar’99), pages 422–429, Tolouse, France, August–September 1999.Google Scholar
- E. Gutiérrez, O. Plata, and E.L. Zapata. A Compiler Method for the Parallel Execution of Irregular Reductions in Scalable Shared Memory Multiprocessors. In Proceedings of the 14th ACM International Conference on Supercomputing (ICS’2000), pages 78–87, Santa Fe, NM, May 2000.Google Scholar
- H. Han and C.-W. Tseng, Improving Compiler and Run-Time Support for Irregular Reductions Using Local Writes. In Proceedings of the 11th Workshop on Languages and Compilers for Parallel Computing (LCPC’98), pages 181–196, Chapel Hill, NC, August 1998.Google Scholar
- H. Han and C.-W. Tseng, Efficient Compiler and Run-Time Support for Parallel Irregular Reductions. J. Parallel Computing, 26(13–14):1709–1738, December 2000.Google Scholar
- H. Han and C.-W. Tseng, Improving Locality for Adaptive Irregular Scientific Codes. In Proceedings of the 13th Workshop on Languages and Compilers for Parallel Computing (LCPC’00), Yorktown Heights, NY, August 2000.Google Scholar
- H. Han and C.-W. Tseng, A Comparison of Parallelization Techniques for Irregular Reductions. In Proceedings of the 15th IEEE International Parallel and Distributed Processing Symposium (IPDPS’2001), San Francisco, CA, April 2001.Google Scholar
- Y. Lin and D. Padua, On the Automatic Parallelization of Sparse and Irregular Fortran Programs. In Proceedings of the 4th Workshop on Languages, Compilers and Runtime Systems for Scalable Computers (LCR’98), Pittsburgh, PA, May 1998.Google Scholar
- N. Mukherjee and J.R. Gurd, A Comparative Analysis of Four Parallelisation Schemes. In Proceedings of the 13th ACM International Conference on Supercomputing (ICS’99), pages 278–285, Rhodes, Greece, June 1999.Google Scholar
- OpenMP Architecture Review Board. OpenMP: A Proposed Industry Standard API for Shared Memory Programming. http://www.openmp.org, 1997.
- L. Rauchwerger and D. Padua. The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 218–232, La Jolla, CA, June 1995.Google Scholar
- H. Yu and L. Rauchwerger. Adaptive Reduction Parallelization Techniques. In Proceedings of the 14th ACM International Conference on Supercomputing (ICS’2000), pages 66–77, Santa Fe, NM, May 2000.Google Scholar