Skip to main content

An Analytical Model-Based Auto-tuning Framework for Locality-Aware Loop Scheduling

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9697))

Included in the following conference series:

Abstract

HPC developers aim to deliver the very best performance. To do so they constantly think about memory bandwidth, memory hierarchy, locality, floating point performance, power/energy constraints and so on. On the other hand, application scientists aim to write performance portable code while exploiting the rich feature set of the hardware. By providing adequate hints to the compilers in the form of directives appropriate executable code is generated. There are tremendous benefits from using directive-based programming. However, applications are also becoming more and more complex and we need sophisticated tools such as auto-tuning to better explore the optimization space. In applications, loops typically form a major and time-consuming portion of the code. Scheduling these loops involves mapping from the loop iteration space to the underlying platform - for example GPU threads. The user tries different scheduling techniques until the best one is identified. However, this process can be quite tedious and time consuming especially when it is a relatively large application, as the user needs to record the performance of every schedule’s run. This paper aims to offer a better solution by proposing an auto-tuning framework that adopts an analytical model guiding the compiler and the runtime to choose an appropriate schedule for the loops, automatically and determining the launch configuration for each of the loop schedules. Our experiments show that the predicted loop schedule by our framework achieves the speedup of 1.29x on an average against the default loop schedule chosen by the compiler.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. EPCC OpenACC Benchmarks (2015). https://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openacc-benchmark-suite

  2. KernelGen Performance Test Suite, December 2015. https://hpcforge.org/plugins/mediawiki/wiki/kernelgen/index.php/Performance_Test_Suite

  3. OpenACC (2016). http://www.openacc.org

  4. Almási, G., Caşcaval, C., Padua, D.A.: Calculating stack distances efficiently. In: ACM SIGPLAN Notices, vol. 38, pp. 37–43. ACM (2002)

    Google Scholar 

  5. Baghsorkhi, S.S., Delahaye, M., Gropp, W.D., Wen-mei, W.H..: Analytical performance prediction for evaluation and tuning of GPGPU applications. In: Workshop on EPHAM2009, in Conjunction with CGO, Citeseer (2009)

    Google Scholar 

  6. Beyls, K., Hollander, E.D.: Reuse distance as a metric for cache behavior. In: Proceedings of the IASTED Conference on Parallel and Distributed Computing and Systems, vol. 14, pp. 350–360 (2001)

    Google Scholar 

  7. Choi, K.H., Kim, S.W.: Study of cache performance on GPGPU. IEIE Trans. Smart Process. Comput. 4(2), 78–82 (2015)

    Article  Google Scholar 

  8. Cui, X., Chen, Y., Zhang, C., Mei, H.: Auto-tuning dense matrix multiplication for GPGPU with cache. In: IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS), pp. 237–242. IEEE (2010)

    Google Scholar 

  9. Grauer-Gray, S., Xu, L., Searles, R., Ayalasomayajula, S., Cavazos, J.: Auto-tuning a high-level language targeted to GPU codes. In: Innovative Parallel Computing (InPar), pp. 1–10. IEEE (2012)

    Google Scholar 

  10. Hu, Y., Koppelman, D.M., Brandt, S.R., Löffler, F.: Model-driven auto-tuning of stencil computations on GPUs. In: Proceedings of the 2nd International Workshop on High-Performance Stencil Computations, pp. 1–8 (2015)

    Google Scholar 

  11. Lee, H., Brown, K.J., Sujeeth, A.K., Rompf, T., Olukotun, K.: Locality-aware mapping of nested parallel patterns on GPUs. In: 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 63–74. IEEE (2014)

    Google Scholar 

  12. Mametjanov, A., Lowell, D., Ma, C.-C., Norris, B.: Autotuning stencil-based computations on GPUs. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 266–274. IEEE (2012)

    Google Scholar 

  13. Montgomery, C., Overbey, J.L., Li, X.: Autotuning openACC work distribution via direct search. In: Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, p. 38. ACM (2015)

    Google Scholar 

  14. Nugteren, C., van den Braak, G.-J., Corporaal, H., Bal, H.: A detailed GPU cache model based on reuse distance theory. In: High Performance Computer Architecture (HPCA), pp. 37–48. IEEE (2014)

    Google Scholar 

  15. Picchi, J., Zhang, W.: Impact of L2 cache locking on GPU performance. In: SoutheastCon, pp. 1–4. IEEE (2015)

    Google Scholar 

  16. Siddiqui, S., AlZayer, F., Feki, S.: Historic learning approach for auto-tuning openACC accelerated scientific applications. VECPAR-2014. LNCS, vol. 8969, pp. 224–235. Springer, Heidelberg (2014)

    Google Scholar 

  17. Tang, T., Yang, X., Lin, Y.: Cache miss analysis for GPU programs based on stack distance profile. In: 31st International Conference on Distributed Computing Systems (ICDCS), pp. 623–634. IEEE (2011)

    Google Scholar 

  18. Tian, X., Xu, R., Yan, Y., Yun, Z., Chandrasekaran, S., Chapman, B.: Compiling a high-level directive-based programming model for GPGPUs. LCPC 2013. LNCS, vol. 8664, pp. 105–120. Springer International Publishing, New York (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rengan Xu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Xu, R., Chandrasekaran, S., Tian, X., Chapman, B. (2016). An Analytical Model-Based Auto-tuning Framework for Locality-Aware Loop Scheduling. In: Kunkel, J., Balaji, P., Dongarra, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9697. Springer, Cham. https://doi.org/10.1007/978-3-319-41321-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41321-1_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41320-4

  • Online ISBN: 978-3-319-41321-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics