Skip to main content

Analysis of Data Reuse in Task-Parallel Runtimes

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8551))

Abstract

This paper proposes a methodology to study the data reuse quality of task-parallel runtimes. We introduce an coarse-grain version of the reuse distance method called Kernel Reuse Distance (KRD). The metric is a low-overhead alternative designed to analyze data reuse at the socket level while minimizing perturbation to the parallel schedule. Using the KRD metric we show that reuse depends considerably on the system configuration (sockets, cores) and on the runtime scheduler. Furthermore, we correlate KRD with hardware metrics such as cache misses and work time inflation. Overall we found that KRD can be used effectively to assess data reuse in parallel applications. The study also revealed that several current runtimes suffer from severe bottlenecks at scale which often dominate performance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. OpenMP ARB: Openmp specification (July 2013), http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf

  2. Intel Corporation: Threading building blocks, https://www.threadingbuildingblocks.org/

  3. MIT Csail Supertech Research Group: The cilk project, http://supertech.csail.mit.edu/cilk/

  4. Frigo, M., Leiserson, C.E., Randall, K.H.: The Implementation of the Cilk-5 Multithreaded Language. In: Proceedings of SIGPLAN 1998 (June 1998)

    Google Scholar 

  5. Mohr, E., Kranz, D.A., Halstead, R.H.: Lazy Task Creation: A technique for Increasing the Granularity of Parallel Programs. IEEE Transactions on Parallel and Distributed Systems 2(3) (July 1991)

    Google Scholar 

  6. Olivier, S.L., de Supinski, B.R., Schulz, M., Prins, J.F.: Characterizing and Mitigating Work Time Inflation in Task Parallel Programs. In: Proceedings of SC 2012 (November 2012)

    Google Scholar 

  7. Tallent, N.R., Mellor-Crummey, J.M.: Effective Performance Measurement and Analysis of Multithreaded Applications. In: Proceedings of PPoPP 2009 (February 2009)

    Google Scholar 

  8. Knüpfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., Müller, M.S., Nagel, W.E.: The Vampir Performance Analysis Tool-Set, pp. 139–155. Springer, Heidelberg (2008)

    Google Scholar 

  9. Barcelona Supercomputing Center: Extrae User Guide Manual (May 2013)

    Google Scholar 

  10. Virtual Institute - High Productivity Supercomputing: SCORE-P User Manual (2013)

    Google Scholar 

  11. McCurdy, C., Vetter, J.: Memphis: Finding and Fixing NUMA-related Performance Problems on Multi-core Platforms. In: Proceedings of ISPASS 2010 (March 2010)

    Google Scholar 

  12. Liu, X., Mellor-Crummey, J.: Pinpointing Data Locality Problems Using Data-centric Analysis. In: Proceedings of CGO 2011 (April 2011)

    Google Scholar 

  13. Intel Corporation: Intel VTune Amplifier XE 2013 (2013), http://software.intel.com/en-us/intel-vtune-amplifier-xe

  14. Mattson, R., Gecsei, J., Slutz, D., Traiger, I.: Evaluation techniques for storage hierarchies. IBM Systems Journal 9(2), 78–117 (1970)

    Article  Google Scholar 

  15. Yokota, R.: exafmm-dev, https://bitbucket.org/rioyokota/exafmm-dev

  16. Taura, K., Yokota, R., Maruyama, N.: A Task Parallelism Meets Fast Multipole Methods. In: Proceedings of the SCALA 2012 Workshop (November 2012)

    Google Scholar 

  17. The MassiveThreads Team: Massivethreads: A lightweight thread library for high productivity languages, http://code.google.com/p/massivethreads/

  18. Nakashima, J., Nakatani, S., Taura, K.: Design and implementation of a customizable work stealing scheduler. In: Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2013, pp. 9:1–9:8 (2013)

    Google Scholar 

  19. Intel Corporation: TBB: Scheduling algorithm, http://www.threadingbuildingblocks.org/docs/help/reference/task_scheduler/scheduling_algorithm.htm

  20. Acar, U.A., Blelloch, G.E., Blumofe, R.D.: The Data Locality of Work Stealing. In: Proceedings of SPAA 2000 (2000)

    Google Scholar 

  21. The Qthread Team: The qthread library, http://www.cs.sandia.gov/qthreads/

  22. Wheeler, K., Murphy, R., Thain, D.: Qthreads: An API for programming with millions of lightweight threads. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–8 (2008)

    Google Scholar 

  23. Olivier, S.L., Porterfield, A.K., Wheeler, K.B., Prins, J.F.: Scheduling Task Parallelism on Multi-Socket Multicore Systems. In: Proceedings of ROSS 2011, pp. 49–56 (2011)

    Google Scholar 

  24. Weaver, V.M.: Linux perf_event Features and Overhead. In: Proceedings of the 2013 FastPath Workshop (2013)

    Google Scholar 

  25. Beyls, K., D’Hollander, E.H.: Reuse distance as a metric for cache behavior. In: Proceedings of the IASTED Conference on Parallel and Distributed Computing and Systems, pp. 617–662 (2001)

    Google Scholar 

  26. Intel Corporation: Intel 64 and ia-32 architectures software developer’s manual volume 3b, http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

  27. PAPI Team: Performance application programming interface, http://icl.cs.utk.edu/papi/

  28. Weinberg, J., McCracken, M.O., Strohmaier, E., Snavely, A.: Quantifying Locality In The Memory Access Patterns of HPC Applications. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing (November 2005)

    Google Scholar 

  29. Intel Corporation: An Introduction to the Intel QuickPath Interconnect (2009)

    Google Scholar 

  30. Hackenberg, D., Molka, D., Nagel, W.E.: Comparing Cache Architectures and Coherency Protocols on x86–64 Multicore SMP Systems. In: Proceedings of MICRO 2009 (December 2009)

    Google Scholar 

Download references

Acknowledgments

This work has been supported by a JSPS postdoctoral fellowship (P-12044). We would like to thank the anonymous reviewers for their valuable feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Miquel Pericàs .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Pericàs, M., Amer, A., Taura, K., Matsuoka, S. (2014). Analysis of Data Reuse in Task-Parallel Runtimes. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation. PMBS 2013. Lecture Notes in Computer Science(), vol 8551. Springer, Cham. https://doi.org/10.1007/978-3-319-10214-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10214-6_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10213-9

  • Online ISBN: 978-3-319-10214-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics