CaCAO: Complex and Compositional Atomic Operations for NoC-Based Manycore Platforms
Tile-based distributed memory systems have increased the scalability of manycore platforms. However, inter-tile memory accesses, especially thread synchronization suffer from high remote access latencies. Our thorough investigations of lock-based and lock-free synchronization primitives show that there is a concurrency dependent cross-over point between them, i.e. there is no one-fits-all solution. Therefore, we propose to combine the conceptual advantages (no retries and lock-free) of both variants by using dedicated hardware support for inter-tile atomic operations. For frequently used and highly concurrent data structures, we show a speedup factor of 23.9 and 35.4 over the lock-based and lock-free implementations respectively, which increases with higher concurrency.
KeywordsAtomic operations Remote synchronization Compare-and-swap Distributed shared memory Network-on-Chip
This work was partly supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Center Invasive Computing [SFB/TR 89]. The authors would also like to thank Christoph Erhardt, Sebastian Maier and Florian Schmaus from FAU Erlangen, as well as Dirk Gabriel from our chair for the helpful discussions.
- 2.Mellanox: Ug130-archoverview-tile-gx. http://www.mellanox.com/repository/solutions/tile-scm/docs/UG130-ArchOverview-TILE-Gx.pdf
- 3.Michael, M.M., Scott, M.L.: Implementation of atomic primitives on distributed shared memory multiprocessors. In: 1995 Proceedings of First IEEE Symposium on High-Performance Computer Architecture, pp. 222–231. IEEE (1995)Google Scholar
- 4.Tsigas, P., Zhang, Y.: Integrating non-blocking synchronisation in parallel applications: performance advantages and methodologies. In: Proceedings of the 3rd International Workshop on Software and Performance, pp. 55–67. ACM (2002)Google Scholar
- 7.Wei, Z., Liu, P., Sun, R., Ying, R.: High-efficient queue-based spin locks for Network-on-Chip processors. In: 2014 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), pp. 260–263. IEEE (2014)Google Scholar
- 8.Wei, Z., Liu, P., Zeng, Z., Xu, J., Ying, R.: Instruction-based high-efficient synchronization in a many-core Network-on-Chip processor. In: 2014 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 2193–2196. IEEE (2014)Google Scholar
- 9.Chen, X., Lu, Z., Jantsch, A., Chen, S.: Handling shared variable synchronization in multi-core Network-on-Chips with distributed memory. In: 2010 IEEE International on SOC Conference (SOCC), pp. 467–472. IEEE (2010)Google Scholar
- 10.Schweizer, H., Besta, M., Hoefler, T.: Evaluating the cost of atomic operations on modern architectures. In: 2015 International Conference on Parallel Architecture and Compilation (PACT), pp. 445–456. IEEE (2015)Google Scholar
- 11.Mellanox: Ug101-user-architecture-reference.pdf. http://www.mellanox.com/repository/solutions/tile-scm/docs/UG101-User-Architecture-Reference.pdf
- 13.Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming. Morgan Kaufmann, Burlington (2011)Google Scholar
- 15.Tian, G., Hammami, O.: Performance measurements of synchronization mechanisms on 16PE NoC based multi-core with dedicated synchronization and data NoC. In: 16th IEEE International Conference on Electronics, Circuits, and Systems, ICECS 2009, pp. 988–991. IEEE (2009)Google Scholar