Abstract
Whereas efficient barrier implementations were once a concern only in high-performance computing, recent trends in core integration make the topic relevant even for general-purpose CMPs. While the nature of CMP applications requires low-latency, the cost of low-latency barrier implementations using hardware-based techniques can be prohibitive for CMPs, where die area represents opportunities for throughput and yield. Similarly, whereas traditional multiprocessor barrier implementations were developed primarily for dedicated environments, scheduling and multi-programming on CMPs require more adaptable barrier implementations.
In this paper, we present and evaluate three barrier implementations that are hybrids of software and dedicated hardware barriers and are specifically tailored for CMPs. The implementations leverage the unique characteristics of CMPs and provide low latency comparable to that of dedicated hardware networks at a fraction of the cost. The implementations also support adaptability, enabling efficient multi-programming and dynamic remapping of the barrier network.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Shang, S., Hwang, K.: Distributed hardwired barrier synchronization for scalable multiprocessor clusters. IEEE Trans. Parallel Distrib. Syst. 6(6), 591–605 (1995)
Hoefler, T.: A survey of barrier algorithms for coarse grained supercomputers. Chemnitzer Informatik-Berichte (2004)
Almási, G., et al.: Optimization of MPI collective communication on Bluegene/L systems. In: ICS 2005, pp. 253–262 (2005)
Ramakrishnan, V., Scherson, I.D.: Efficient techniques for nested and disjoint barrier synchronization. J. Parallel Distrib. Comput. 58(2), 333–356 (1999)
Chen, J., Watson, W.: Software barrier performance on dual quad-core Opterons. In: NAS 2008, pp. 303–309 (2008)
Nikolopoulos, D., Papatheodorou, T.: Fast synchronization on scalable cache-coherent multiprocessors using hybrid primitives. In: IPDPS 2000, p. 711 (2000)
Lee, J.B., Jhon, C.S.: Reducing coherence overhead of barrier synchronization in software DSMs. In: ICS 1998, pp. 1–18 (1998)
Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991)
Coteus, P., et al.: Packaging the BlueGene/L supercomputer. IBM Journal of Research and Development 49(2-3), 213–248 (2005)
Adams, D.: Cray T3D system architecture overview manual (1993), ftp://ftp.cray.com/product-info/mpp/T3D_Architecture_Over/T3D.overview.html
Freudenthal, E., Peze, O.: Efficient synchronization algorithms using fetch-and-add on multiple bitfield integers. Ultracomputer Note 148 (1988)
Beckmann, C., Polychronopoulos, C.: Fast barrier synchronization hardware. In: ICS 1990, pp. 180–189 (1990)
Biswas, R.: NAS parallel benchmarks (2009), http://www.nas.nasa.gov
Kumar, R., Zyuban, V., Tullsen, D.: Interconnections in multi-core architectures: Understanding mechanisms, overheads, and scaling. In: ISCA 2005 (2005)
Althaus, E., Funke, S., Har-peled, S., Knemann, J.: Approximating k-hop minimum-spanning trees. Operations Research Letters 33, 120 (2005)
Kumar, A., et al.: Express virtual channels: Towards the ideal interconnection fabric. SIGARCH Comput. Archit. News 35(2), 150–161 (2007)
Binkert, N.L., et al.: The M5 simulator: Modeling networked systems. MICRO 26(4), 52–60 (2006)
Sampson, J., et al.: Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. MICRO 39, 235–246 (2006)
McMahon, F.: Livermore loops coded in C (1992), http://www.netlib.org/benchmark/livermorec
E.M.B. Consortium: EEMBC (2009), http://www.eembc.org
Zhu, W., et al.: Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In: ISCA 2007, pp. 35–45 (2007)
Villa, O., Palermo, G., Silvano, C.: Efficiency and scalability of barrier synchronization on NOC based many-core architectures. In: CASES 2008, pp. 81–90 (2008)
Scott, S.L.: Synchronization and communication in the T3E multiprocessor. SIGOPS Oper. Syst. Rev. 30(5), 26–36 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sartori, J., Kumar, R. (2010). Low-Overhead, High-Speed Multi-core Barrier Synchronization . In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds) High Performance Embedded Architectures and Compilers. HiPEAC 2010. Lecture Notes in Computer Science, vol 5952. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11515-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-11515-8_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11514-1
Online ISBN: 978-3-642-11515-8
eBook Packages: Computer ScienceComputer Science (R0)