Skip to main content

Balancing Shared and Distributed Heaps on NUMA Architectures

  • Conference paper
  • First Online:
Trends in Functional Programming (TFP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8843))

Included in the following conference series:

  • 637 Accesses

Abstract

Due to the varying latencies between memory banks, efficient shared memory access is challenging on modern NUMA architectures. This has a major impact on the shared memory performance of parallel programs, particularly those written in languages with automatic memory management.

This paper presents a performance evaluation of distributed and shared heap implementations of parallel Haskell on a state-of-the-art physical shared memory NUMA machine. The evaluation exposes bottlenecks in the shared-memory management, which results in limits to scalability beyond 25 out of the 48 cores.

We demonstrate that a hybrid system, GUMSMP, that combines both distributed and shared heap abstractions consistently outperforms the shared memory GHC implementation on seven benchmarks by a factor of 3.3 on average. Specifically, we show that the best results are obtained when sharing memory only within a single NUMA region, and using distributed memory system abstractions across the regions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aljabri, M., Loidl, H.-W., Trinder, P.W.: The design and implementation of GUMSMP: a multilevel parallel haskell implementation. In: Proceedings of the 25th ACM SIGPLAN Symposium on Implementation and Application of Functional Languages, IFL 2013. ACM, Nijmegen (2013). http://dx.doi.org/10.1145/2620678.2620682

  2. Alnowaiser, K.: A study of connected object locality in numa heaps. In: Proceedings of the 2014 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC 2014, pp. 1:1–1:9. ACM, New York (2014). http://doi.acm.org/10.1145/2618128.2618132

  3. Auhagen, S., Bergstrom, L., Fluet, M., Reppy, J.: Garbage collection for multicore NUMA machines. In: Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC 2011, pp. 51–57. ACM, New York (2011). http://doi.acm.org/10.1145/1988915.1988929

  4. Benner, R., Echeverria, V.T.E., Onunkwo, U., Patel, J., Zage, D.: Harnessing manycore processors for scalable, highly efficient, and adaptable firewall solutions. In: 2013 International Conference on Computing, Networking and Communications (ICNC), pp. 637–641, January 2013. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6504161&isnumber=6504039

  5. Bergstrom, L.: Measuring numa effects with the stream benchmark. CoRR, abs/1103.3225 (2011). http://dblp.uni-trier.de/db/journals/corr/corr1103.html#abs-1103-3225

  6. Berthold, J., Loidl, H.-W., Hammond, K.: PAEAN: Portable Runtime Support for Physically-Shared-Nothing Architectures in Parallel Haskell Dialects. Journal of Functional Programming (2015). To appear in Special Issue on Runtime-environments

    Google Scholar 

  7. Gidra, L., Thomas, G., Sopena, J., Shapiro, M.: Assessing the scalability of garbage collectors on many cores. In: Proceedings of the 6th Workshop on Programming Languages and Operating Systems, PLOS 2011, pp. 7:1–7:5. ACM, New York (2011). http://doi.acm.org/10.1145/2039239.2039249

  8. Jones Jr., D., Marlow, S., Singh, S.: Parallel performance tuning for haskell. In: Proceedings of the 2nd ACM SIGPLAN Symposium on Haskell, Haskell 2009, pp. 81–92. ACM, New York (2009). http://doi.acm.org/10.1145/1596638.1596649

  9. Lameter, C.: NUMA (Non-Uniform Memory Access): An Overview. Queue 11(7), 40:40–40:51 (2013). http://doi.acm.org/10.1145/2508834.2513149

  10. Lester, D.: An efficient distributed garbage collection algorithm. In: Odijk, E., Rem, M., Syre, J.-C. (eds.) PARLE 1989. LNCS, vol. 365, pp. 207–223. Springer, Heidelberg (1989). http://dx.doi.org/10.1007/3540512845_41

    Google Scholar 

  11. Marlow, S., Peyton Jones, S.L.: Multicore garbage collection with local heaps. In: Proceedings of the International Symposium on Memory Management, ISMM 2011, pp. 21–32. ACM, New York (2011). http://doi.acm.org/10.1145/1993478.1993482

  12. Marlow, S., Peyton Jones, S.L., Singh, S.: Runtime support for multicore haskell. In: Proceedings of the 14th ACM SIGPLAN International Conference on Functional Programming, ICFP 2009, pp. 65–78. ACM, New York (2009). http://doi.acm.org/10.1145/1596550.1596563

  13. Marlow, S., Peyton Jones, S.L.: The Glasgow Haskell Compiler. In: The Architecture of Open Source Applications, vol. 2. lulu.com (2012). http://www.aosabook.org/en/ghc.html

  14. Marlow, S., Harris, T., James, R.P., Peyton Jones, S.L.: Parallel generational-copying garbage collection with a block-structured heap. In: Proceedings of the 7th International Symposium on Memory Management, ISMM 2008, pp. 11–20. ACM, New York (2008). http://doi.acm.org/10.1145/1375634.1375637

  15. Peyton Jones, S.L.: Parallel Implementations of Functional Programming Languages. Comput. J. 32, 175–186 (1989). http://portal.acm.org/citation.cfm?id=63410.63418

  16. Su, C., Li, D., Nikolopoulos, D.S., Grove, M., Cameron, K., de Supinski, B.R.: Critical Path-based Thread Placement for NUMA Systems. SIGMETRICS Perform. Eval. Rev. 40(2), 106–112 (2012). http://doi.acm.org/10.1145/2381056.2381079

  17. Tan, L., Yufei, R., Dantong, Y., Shudong, J., Robertazzi, T.: Characterization of input/output bandwidth performance models in NUMA architecture for data intensive applications. In: 2013 42nd International Conference on Parallel Processing (ICPP), pp. 369–378, October 2013. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6687370&isnumber=6687321

  18. Terboven, C., Schmidl, D., Cramer, T., an Mey, D.: Assessing OpenMP tasking implementations on NUMA architectures. In: Chapman, B.M., Müller, M.S., Massaioli, F., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 182–195. Springer, Heidelberg (2012). http://link.springer.com/chapter/10.1007%2F978-3-642-30961-8_14

    Google Scholar 

  19. Trinder, P., Hammond, K., Mattson Jr., J.S., Partridge, A.S., Peyton Jones, S.L.: GUM: a portable parallel implementation of haskell. In: Programming Languages Design and Implementation, PLDI 1996, Philadelphia, PA, USA, pp. 79–88, May 1996. http://dx.doi.org/10.1145/231379.231392

  20. Yang, E.Z.: The GHC Runtime System, July 2013. http://ezyang.com/jfp-ghc-rts-draft.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Malak Aljabri .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Aljabri, M., Loidl, HW., Trinder, P. (2015). Balancing Shared and Distributed Heaps on NUMA Architectures. In: Hage, J., McCarthy, J. (eds) Trends in Functional Programming. TFP 2014. Lecture Notes in Computer Science(), vol 8843. Springer, Cham. https://doi.org/10.1007/978-3-319-14675-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-14675-1_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-14674-4

  • Online ISBN: 978-3-319-14675-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics