Busy-Wait Barrier Synchronization Using Distributed Counters with Local Sensor

  • Guansong Zhang
  • Francisco Martínez
  • Arie Tal
  • Bob Blainey
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2716)


Barrier synchronization is an important and performance critical primitive in many parallel programming models, including the popular OpenMP model. In this paper, we compare the performance of several software implementations of barrier synchronization and introduce a new implementation, distributed counters with local sensor, which considerably reduces overhead on POWER3 and POWER4 SMP systems. Through experiments with the EPCC OpenMP benchmark, we demonstrate a 79% reduction in overhead on a 32-way POWER4 system and an 87% reduction in overhead on a 16-way POWER3 system when comparing with a fetch-and-add implementation. Since these improvements are primarily attributed to reduced L2 and L3 cache misses, we expect the relative performance of our implementation to increase with the number of processors in an SMP and as memory latencies lengthen relative to cache latencies.


Barrier synchronization multiprocessor distributed counter 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Message Passing Interface Forum. MPI: A message-passing interface standard, 1994.Google Scholar
  2. 2.
    V.S. Sunderam. PVM: A framework for parallel distributed computing. Concurrency, Practice and Experience, 2(4):315–339, December 1990.CrossRefGoogle Scholar
  3. 3.
    Arvind Krishnamurthy and Katherine A. Yelick. Optimizing parallel programs with explicit synchronization. In SIGPLAN Conference on Programming Language Design and Implementation, pages 196–204, 1995.Google Scholar
  4. 4.
    OpenMP Architecture Review Board. OpenMP specification FORTRAN version 2.0, 2000.
  5. 5.
    OpenMP Architecture Review Board. OpenMP specification C/C++ version 2.0, 2002.
  6. 6.
    Edinburgh Parallel Computing Center. OpenMP microbenchmarks, 1999.
  7. 7.
    John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. on Computer Systems, 9(1):21–65, February 1991.CrossRefGoogle Scholar
  8. 8.
    Dimitrios S. Nikolopoulos and Theodore S. Papatheodorou. A quantitative architectural evaluation of synchronization algorithms and disciplines on ccNUMA systems: The case of the SGI Origin2000. June 1999.Google Scholar
  9. 9.
    Steve Behling et al. The POWER4 processor introduction and tuning guide. Technical Report SG24-7041-00, International Technical Support Organization, November 2001. ISBN 0738423556.Google Scholar
  10. 10.
    J. M. Bull. Measuring synchronization and scheduling overheads in OpenMP. In First European Workshop on OpenMP, October 1999.Google Scholar
  11. 11.
    IBM Technical Disclosure Bulletin. Barrier Synchronization Using Fetch-and-Add and Broadcast. 34(8):33–34, 1992.Google Scholar
  12. 12.
    Rainer Kreuzburg. Method of synchronization, 2001. United States Patent, No. US 6,330,619.Google Scholar
  13. 13.
    Stefan Andersson et al. RS/6000 scientific and technical computing: POWER3 introduction and tuning guide. Technical Report SG24-5155-00, International Technical Support Organization, October 1998.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Guansong Zhang
    • 1
  • Francisco Martínez
    • 1
  • Arie Tal
    • 1
  • Bob Blainey
    • 1
  1. 1.IBM Toronto LabTorontoCanada

Personalised recommendations