Abstract
Barrier synchronization is an important and performance critical primitive in many parallel programming models, including the popular OpenMP model. In this paper, we compare the performance of several software implementations of barrier synchronization and introduce a new implementation, distributed counters with local sensor, which considerably reduces overhead on POWER3 and POWER4 SMP systems. Through experiments with the EPCC OpenMP benchmark, we demonstrate a 79% reduction in overhead on a 32-way POWER4 system and an 87% reduction in overhead on a 16-way POWER3 system when comparing with a fetch-and-add implementation. Since these improvements are primarily attributed to reduced L2 and L3 cache misses, we expect the relative performance of our implementation to increase with the number of processors in an SMP and as memory latencies lengthen relative to cache latencies.
Ph.D. candidate, research student visiting from the Technical University of Catalonia (UPC), Barcelona, Spain.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Message Passing Interface Forum. MPI: A message-passing interface standard, 1994.
V.S. Sunderam. PVM: A framework for parallel distributed computing. Concurrency, Practice and Experience, 2(4):315–339, December 1990.
Arvind Krishnamurthy and Katherine A. Yelick. Optimizing parallel programs with explicit synchronization. In SIGPLAN Conference on Programming Language Design and Implementation, pages 196–204, 1995.
OpenMP Architecture Review Board. OpenMP specification FORTRAN version 2.0, 2000. http://www.openmp.org.
OpenMP Architecture Review Board. OpenMP specification C/C++ version 2.0, 2002. http://www.openmp.org.
Edinburgh Parallel Computing Center. OpenMP microbenchmarks, 1999. http://www.epcc.ed.ac.uk/research/openmpbench.
John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. on Computer Systems, 9(1):21–65, February 1991.
Dimitrios S. Nikolopoulos and Theodore S. Papatheodorou. A quantitative architectural evaluation of synchronization algorithms and disciplines on ccNUMA systems: The case of the SGI Origin2000. June 1999.
Steve Behling et al. The POWER4 processor introduction and tuning guide. Technical Report SG24-7041-00, International Technical Support Organization, November 2001. ISBN 0738423556.
J. M. Bull. Measuring synchronization and scheduling overheads in OpenMP. In First European Workshop on OpenMP, October 1999.
IBM Technical Disclosure Bulletin. Barrier Synchronization Using Fetch-and-Add and Broadcast. 34(8):33–34, 1992.
Rainer Kreuzburg. Method of synchronization, 2001. United States Patent, No. US 6,330,619.
Stefan Andersson et al. RS/6000 scientific and technical computing: POWER3 introduction and tuning guide. Technical Report SG24-5155-00, International Technical Support Organization, October 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, G., Martínez, F., Tal, A., Blainey, B. (2003). Busy-Wait Barrier Synchronization Using Distributed Counters with Local Sensor. In: Voss, M.J. (eds) OpenMP Shared Memory Parallel Programming. WOMPAT 2003. Lecture Notes in Computer Science, vol 2716. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45009-2_7
Download citation
DOI: https://doi.org/10.1007/3-540-45009-2_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40435-4
Online ISBN: 978-3-540-45009-2
eBook Packages: Springer Book Archive