Extending hardware transactional memory capacity via rollback-only transactions and suspend/resume
Abstract
Transactional memory (TM) aims at simplifying concurrent programming via the familiar abstraction of atomic transactions. Recently, Intel and IBM have integrated hardware based TM (HTM) implementations in commodity processors, paving the way for the mainstream adoption of the TM paradigm. Yet, existing HTM implementations suffer from a crucial limitation, which hampers the adoption of HTM as a general technique for regulating concurrent access to shared memory: the inability to execute transactions whose working sets exceed the capacity of CPU caches. In this article we propose P8TM, a novel approach that mitigates this limitation on IBM’s POWER8 architecture by leveraging a key combination of hardware and software techniques to support different execution paths. P8TM also relies on self-tuning mechanisms aimed at dynamically switching between different execution modes to best adapt to the workload characteristics. In-depth evaluation with several benchmarks indicates that P8TM can achieve striking performance gains in workloads that stress the capacity limitations of HTM, while achieving performance on par with HTM even in unfavourable workloads.
Keywords
Hardware transactional memory Parallel programming Concurrency Self-tuningNotes
Acknowledgements
This work was supported by Portuguese funds through Fundação para a Ciência e Tecnologia via projects UID/CEC/50021/2019 and PTDC/EEISCR/1743/2014.
References
- 1.Afek, Y., Levy, A., Morrison, A.: Programming with hardware lock elision. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’13, pp. 295–296. ACM, New York (2013). https://doi.org/10.1145/2442516.2442552
- 2.Arbel, M., Attiya, H.: Concurrent updates with RCU: Search tree as an example. In: Proceedings of the 2014 ACM Symposium on Principles of Distributed Computing, PODC ’14, pp. 196–205. ACM, New York (2014). https://doi.org/10.1145/2611462.2611471
- 3.Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3, 397–422 (2003)MathSciNetzbMATHGoogle Scholar
- 4.Berry, D., Fristedt, B.: Bandit Problems. Chapman and Hall, London (1985)CrossRefGoogle Scholar
- 5.Boehm, H., Gottschlich, J., Luchangco, V., Michael, M., Moir, M., Nelson, C., Riegel, T., Shpeisman, T., Wong, M.: Transactional language constructs for C++. ISO/IEC JTC1/SC22 WG21 (C++) (2012)Google Scholar
- 6.Cain, H.W., Michael, M.M., Frey, B., May, C., Williams, D., Le, H.: Robust architectural support for transactional memory in the power architecture. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, pp. 225–236. ACM, New York (2013). https://doi.org/10.1145/2485922.2485942
- 7.Calciu, I., Shpeisman, T., Pokam, G., Herlihy, M.: Improved single global lock fallback for best-effort hardware transactional memory. In: 9th ACM SIGPLAN Wkshp. on Transactional Computing (2014)Google Scholar
- 8.Clements, A.T., Kaashoek, M.F., Zeldovich, N.: Scalable address spaces using RCU balanced trees. In: Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pp. 199–210. ACM, New York (2012). https://doi.org/10.1145/2150976.2150998
- 9.Dalessandro, L., Carouge, F., White, S., Lev, Y., Moir, M., Scott, M.L., Spear, M.F.: Hybrid NOrec: A case study in the effectiveness of best effort hardware transactional memory. In: Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pp. 39–52. ACM, New York (2011). https://doi.org/10.1145/1950365.1950373
- 10.Damron, P., Fedorova, A., Lev, Y., Luchangco, V., Moir, M., Nussbaum, D.: Hybrid transactional memory. In: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XII, pp. 336–346. ACM, New York (2006). https://doi.org/10.1145/1168857.1168900
- 11.Di Sanzo, P., Ciciani, B., Palmieri, R., Quaglia, F., Romano, P.: On the analytical modeling of concurrency control algorithms for software transactional memories: the case of commit-time-locking. Perform. Eval. 69(5), 187–205 (2012). https://doi.org/10.1016/j.peva.2011.05.002 CrossRefGoogle Scholar
- 12.di Sanzo, P., Quaglia, F., Palmieri, R.: Analytical modelling of commit-time-locking algorithms for software transactional memories. In: Int. CMG Conference (2010)Google Scholar
- 13.Dice, D., Harris, T.L., Kogan, A., Lev, Y., Moir, M.: Hardware extensions to make lazy subscription safe. CoRR arXiv:1407.6968 (2014)
- 14.Dice, D., Lev, Y., Moir, M., Nussbaum, D.: Early experience with a commercial hardware transactional memory implementation. In: Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIV, pp. 157–168. ACM, New York (2009). https://doi.org/10.1145/1508244.1508263
- 15.Dice, D., Shavit, N.: Understanding tradeoffs in software transactional memory. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’07, pp. 21–33. IEEE Computer Society, Washington, DC (2007). https://doi.org/10.1109/CGO.2007.38
- 16.Diegues, N., Romano, P.: Self-tuning intel transactional synchronization extensions. In: 11th International Conference on Autonomic Computing (ICAC 14), pp. 209–219. USENIX Association, Philadelphia (2014). https://www.usenix.org/conference/icac14/technical-sessions/presentation/diegues
- 17.Diegues, N., Romano, P., Garbatov, S.: Seer: probabilistic scheduling for hardware transactional memory. In: Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’15, pp. 224–233. ACM, New York (2015). https://doi.org/10.1145/2755573.2755578. http://doi.acm.org.focus.lib.kth.se/10.1145/2755573.2755578
- 18.Diegues, N., Romano, P., Rodrigues, L.: Virtues and limitations of commodity hardware transactional memory. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT ’14, pp. 3–14. ACM, New York (2014). https://doi.org/10.1145/2628071.2628080
- 19.Evan Jones: tpccbench. https://github.com/evanj/tpccbench (2007). Accessed 23 Sept 2019
- 20.Felber, P., Issa, S., Matveev, A., Romano, P.: Hardware read-write lock elision. In: Proceedings of the Eleventh European Conference on Computer Systems, EuroSys ’16, pp. 34:1–34:15. ACM, New York (2016). https://doi.org/10.1145/2901318.2901346
- 21.Garivier, A., Moulines, E.: On upper-confidence bound policies for switching bandit problems. In: Proceedings of the 22nd International Conference on Algorithmic Learning Theory, ALT’11, pp. 174–188. Springer, Berlin (2011). http://dl.acm.org/citation.cfm?id=2050345.2050365 Google Scholar
- 22.Goel, B., Titos-Gil, R., Negi, A., McKee, S.A., Stenstrom, P.: Performance and energy analysis of the restricted transactional memory implementation on haswell. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 615–624 (2014). https://doi.org/10.1109/IPDPS.2014.70
- 23.Guerraoui, R., Kapalka, M.: On the correctness of transactional memory. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’08, pp. 175–184. ACM, New York (2008). https://doi.org/10.1145/1345206.1345233
- 24.Harris, T.L.: A pragmatic implementation of non-blocking linked-lists. In: Proceedings of the 15th International Conference on Distributed Computing, DISC ’01, pp. 300–314. Springer, London (2001). http://dl.acm.org/citation.cfm?id=645958.676105 Google Scholar
- 25.Hart, T.E., McKenney, P.E., Brown, A.D., Walpole, J.: Performance of memory reclamation for lockless synchronization. J. Parallel Distrib. Comput. 67(12), 1270–1285 (2007). https://doi.org/10.1016/j.jpdc.2007.04.010 CrossRefzbMATHGoogle Scholar
- 26.Issa, S.: P8TM source code and benchmarks. https://github.com/shadyalaa/POWER8TM (2017). Accessed 23 Sept 2019
- 27.Jacobi, C., Slegel, T., Greiner, D.: Transactional memory architecture and implementation for IBM System Z. In: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pp. 25–36. IEEE Computer Society, Washington, DC (2012). https://doi.org/10.1109/MICRO.2012.12
- 28.Java docs: ReentrantReadWriteLock. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReentrantReadWriteLock.html (2018). Accessed 23 Sept 2019
- 29.Kumar, S., Chu, M., Hughes, C.J., Kundu, P., Nguyen, A.: Hybrid transactional memory. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’06, pp. 209–220. ACM, New York (2006). https://doi.org/10.1145/1122971.1123003
- 30.Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6, 4–22 (1985)MathSciNetCrossRefGoogle Scholar
- 31.Le, H.Q., Guthrie, G.L., Williams, D.E., Michael, M.M., Frey, B.G., Starke, W.J., May, C., Odaira, R., Nakaike, T.: Transactional memory support in the IBM POWER8 processor. IBM J. Res. Dev. 59(1), 8:1–8:14 (2015). https://doi.org/10.1147/JRD.2014.2380199 CrossRefGoogle Scholar
- 32.Marathe, V.J., Scherer, W.N., Scott, M.L.: Adaptive software transactional memory. In: Proceedings of the 19th International Conference on Distributed Computing, DISC’05, pp. 354–368. Springer, Berlin (2005). https://doi.org/10.1007/11561927_26 Google Scholar
- 33.Matveev, A., Shavit, N.: Reduced Hardware NOrec: a safe and scalable hybrid transactional memory. In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’15, pp. 59–71. ACM, New York (2015). https://doi.org/10.1145/2694344.2694393
- 34.Mckenney, P.E., Slingwine, J.D.: Read-copy update: using execution history to solve concurrency problems. In: Parallel and Distributed Computing and Systems, pp. 509–518. Las Vegas, NV (1998)Google Scholar
- 35.McKenney, P.E., Walpole, J.: What is RCU, fundamentally? https://lwn.net/Articles/262464/ (2007). Accessed 23 Sept 2019
- 36.Minh, C.C., Chung, J., Kozyrakis, C., Olukotun, K.: Stamp: Stanford transactional applications for multi-processing. In: IEEE International Symposium on Workload Characterization, 2008. IISWC 2008, pp. 35–46. IEEE (2008)Google Scholar
- 37.Nakaike, T., Odaira, R., Gaudet, M., Michael, M.M., Tomari, H.: Quantitative comparison of hardware transactional memory for Blue Gene/Q, zEnterprise EC12, Intel Core, and POWER8. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA ’15, pp. 144–157. ACM, New York (2015). https://doi.org/10.1145/2749469.2750403
- 38.Nussbaum, D., Lev, Y., Moir, M.: PhTM: Phased transactional memory. In: The Second ACM SIGPLAN Workshop on Transactional Computing, TRANSACT ’07. ACM, New York (2007). https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=4058
- 39.Ortner, R., Ryabko, D., Auer, P., Munos, R.: Regret bounds for restless markov bandits. In: Proceedings of the 23rd International Conference on Algorithmic Learning Theory, ALT’12, pp. 214–228. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-34106-9_19 Google Scholar
- 40.Raminhas, P., Issa, T., Romano, P.: Enhancing efficiency of hybrid transactional memory via dynamic data partitioning schemes. In: Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid ’18, pp. 1–11 (2018). https://doi.org/10.1109/CCGRID.2018.94
- 41.Rossbach, C.J., Hofmann, O.S., Witchel, E.: Is transactional programming actually easier? In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’10, pp. 47–56. ACM, New York (2010). https://doi.org/10.1145/1693453.1693462
- 42.Stonebraker, M., Madden, S., Abadi, D.J., Harizopoulos, S., Hachem, N., Helland, P.: The end of an architectural era: (it’s time for a complete rewrite). In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB ’07, pp. 1150–1160. VLDB Endowment (2007). http://dl.acm.org/citation.cfm?id=1325851.1325981
- 43.TPC Council: TPC-C Benchmark. http://www.tpc.org/tpcc (2011). Accessed 23 Sept 2019
- 44.Wang, A., Gaudet, M., Wu, P., Amaral, J.N., Ohmacht, M., Barton, C., Silvera, R., Michael, M.: Evaluation of Blue Gene/Q hardware support for transactional memories. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT ’12, pp. 127–136. ACM, New York (2012). https://doi.org/10.1145/2370816.2370836
- 45.Wang, Q., Kulkarni, S., Cavazos, J., Spear, M.: A transactional memory with automatic performance tuning. ACM Trans. Archit. Code Optim. 8(4), 54:1–54:23 (2012). https://doi.org/10.1145/2086696.2086733 CrossRefGoogle Scholar
- 46.Yoo, R.M., Hughes, C.J., Lai, K., Rajwar, R.: Performance evaluation of Intel® transactional synchronization extensions for high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pp. 19:1–19:11. ACM, New York (2013). https://doi.org/10.1145/2503210.2503232
- 47.Yu, J.Y., Mannor, S.: Piecewise-stationary bandit problems with side observations. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pp. 1177–1184. ACM, New York (2009). https://doi.org/10.1145/1553374.1553524