TiDA: High-Level Programming Abstractions for Data Locality Management

Unat, Didem; Nguyen, Tan; Zhang, Weiqun; Farooqi, Muhammed Nufail; Bastem, Burak; Michelogiannakis, George; Almgren, Ann; Shalf, John

doi:10.1007/978-3-319-41321-1_7

Didem Unat¹⁶,
Tan Nguyen¹⁷,
Weiqun Zhang¹⁷,
Muhammed Nufail Farooqi¹⁶,
Burak Bastem¹⁶,
George Michelogiannakis¹⁷,
Ann Almgren¹⁷ &
…
John Shalf¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9697))

Included in the following conference series:

International Conference on High Performance Computing

2836 Accesses
10 Citations

Abstract

The high energy costs for data movement compared to computation gives paramount importance to data locality management in programs. Managing data locality manually is not a trivial task and also complicates programming. Tiling is a well-known approach that provides both data locality and parallelism in an application. However, there is no standard programming construct to express tiling at the application level. We have developed a multicore programming model, TiDA, based on tiling and implemented the model as C++ and Fortran libraries. The proposed programming model has three high level abstractions, tiles, regions and tile iterator. These abstractions in the library hide the details of data decomposition, cache locality optimizations, and memory affinity management in the application. In this paper we unveil the internals of the library and demonstrate the performance and programability advantages of the model on five applications on multiple NUMA nodes. The library achieves up to 2.10x speedup over OpenMP in a single compute node for simple kernels, and up to 22x improvement over a single thread for a more complex combustion proxy application (SMC) on 24 cores. The MPI+TiDA implementation of geometric multigrid demonstrates a 30.9 % performance improvement over MPI+OpenMP when scaling to 3072 cores (excluding MPI communication overheads, 8.5 % otherwise).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

PLuTo, A polyhedral automatic parallelizer and locality optimizer for multicores. Software. http://pluto-compiler.sourceforge.net
Real World Technologies: Knights Landing Details. http://www.realworldtech.com/knights-landing-details/
Balfour, J., Dally, W.J.: Design tradeoffs for tiled CMP on-chip networks. In: Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006 (2006)
Google Scholar
Bertozzi, S., Acquaviva, A., Bertozzi, D., Poggiali, A.: Supporting task migration in multi-processor systems-on-chip: a feasibility study. In: Proceedings of Design, Automation and Test in Europe, 2006, DATE 2006, vol. 1, pp. 1–6, March 2006
Google Scholar
Bianco, M., Cumming, B.: A generic strategy for multi-stage stencils. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014 Parallel Processing. LNCS, vol. 8632, pp. 584–595. Springer, Heidelberg (2014)
Google Scholar
Bikshandi, G., Guo, J., Hoeflinger, D., Almasi, G., Fraguela, B.B., Garzarán, M.J., Padua, D., von Praun, C.: Programming for parallelism and locality with hierarchically tiled arrays. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPopp, 2006, pp. 48–57. ACM, New York (2006)
Google Scholar
Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not. 3(6), 101–113 (2008)
Article Google Scholar
Chen, J.H., Choudhary, A., de Supinski, B., DeVries, M., Hawkes, E.R., Klasky, S., Liao, W.K., Ma, K.L., Mellor-Crummey, J., Podhorszki, N., Sankaran, R., Shende, S., Yoo, C.S.: Terascale direct numerical simulations of turbulent combustion using S3D. Comput. Sci. Discovery 2(1), 015001 (2009)
Article Google Scholar
Chen, R., Chen, H.: Tiled-mapreduce: efficient and flexible mapreduce processing on multicore with tiling. ACM Trans. Archit. Code Optim. 10(1), 3:1–3:30 (2013)
Google Scholar
Das, R., Mutlu, O., Moscibroda, T., Das, C.R.: Application-aware prioritization mechanisms for on-chip networks. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO, pp. 280–291 (2009)
Google Scholar
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the ACM/IEEE Conference on Supercomputing, SC 2008, pp. 4:1–4:12. IEEE Press, Piscataway (2008)
Google Scholar
Edwards, H.C., Sunderland, D., Porter, V., Amsler, C., Mish, S.: Manycore performance-portability: Kokkos multidimensional array library. Sci. Program. 20(2), 89–114 (2012)
Google Scholar
Emmett, M., Zhang, W., Bell, J.B.: High-order algorithms for compressible reacting flow with complex chemistry. Combust. Theor. Model. 18(3), 361–387 (2014)
Article MathSciNet Google Scholar
Fuchs, T., Fürlinger, K.: Expressing and exploiting multidimensional locality in DASH. In: Proceedings of the SPPEXA Symposium 2016. Lecture Notes in Computational Science and Engineering, Garching, Germany, January 2016
Google Scholar
Goglin, B.: Managing the topology of heterogeneous cluster nodes with hardware locality (hwloc). In: International Conference on High Performance Computing and Simulation, HPCS 2014, Bologna, Italy, 21–25 July 2014, pp. 74–81 (2014)
Google Scholar
Hall, M., Chame, J., Chen, C., Shin, J., Rudy, G., Khan, M.M.: Loop transformation recipes for code generation and auto-tuning. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds.) LCPC 2009. LNCS, vol. 5898, pp. 50–64. Springer, Heidelberg (2010)
Chapter Google Scholar
Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Proceedings of the 23rd International Conference on Supercomputing, ICS 2009, pp. 147–157. ACM, New York (2009)
Google Scholar
Jingcao, H., Marculescu, R.: Energy-aware mapping for tile-based NoC architectures under performance constraints. In: Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC 2003, pp. 233–239 (2003)
Google Scholar
Kim, D., Rajopadhye, S.: Parameterized tiling for imperfectly nested loops. Technical report CS-09-101, Department of Computer Science, Colorado State University (2009)
Google Scholar
Kim, D., Renganarayanan, L., Rostron, D., Rajopadhye, S., Strout, M.M.: Multi-level tiling: M for the price of one. In: Proceedings of the ACM/IEEE Conference on Supercomputing, SC 2007, pp. 51:1–51:12. ACM, New York (2007)
Google Scholar
Murali, S., De Micheli, G.: Bandwidth-constrained mapping of cores onto NoC architectures. In: Proceedings of the Conference on Design, Automation and Test in Europe - vol. 2, DATE ’04, (2004)
Google Scholar
Renganarayanan, L., Kim, D.G., Rajopadhye, S., Strout, M.M.: Parameterized tiled loops for free. SIGPLAN Not. 42(6), 405–414 (2007)
Article Google Scholar
Rogers, B.M., Krishna, A., Bell, G.B., Ken, V., Jiang, X., Solihin, Y.: Scaling the bandwidth wall: challenges in and avenues for CMP scaling. In: Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA, pp. 371–382 (2009)
Google Scholar
Shalf, J., Dosanjh, S., Morrison, J.: Exascale computing technology challenges. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 1–25. Springer, Heidelberg (2011)
Chapter Google Scholar
Unat, D., Cai, X., Baden, S.B.: Mint: realizing CUDA performance in 3D stencil methods with annotated C. In: Proceedings of the International Conference on Supercomputing, ICS 2011, pp. 214–224. ACM, New York (2011)
Google Scholar
Unat, D., Chan, C., Zhang, W., Bell, J., Shalf, J.: Tiling as a durable abstraction for parallelism and data locality. In: Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, 18 November 2013
Google Scholar
Unat, D., Chan, C., Zhang, W., Williams, S., Bachan, J., Bell, J., Shalf, J.: Exasat: an exascale co-design tool for performance modeling. Int. J. High Perform. Comput. Appl. 29(2), 209–232 (2015)
Article Google Scholar
Unat, D., Shalf, J., Hoefler, T., Schulthess, T., Dubey, A., (eds.) et al.: Programming abstractions for data locality. Technical report (2014)
Google Scholar
Vega, A., Cabarcas, F., Ramirez, A., Valero, M.: Breaking the bandwidth wall in chip multiprocessors. In: International Conference on Embedded Computer Systems, SAMOS, pp. 255–262 (2011)
Google Scholar
Zhang, W., Almgren, A., Day, M., Nguyen, T., Shalf, J., Unat, D.: BoxLib with tiling: an AMR software framework. SIAM J. Sci. Comput. (2016)
Google Scholar
Zhou, W., Zhang, Y., Mao, Z.: An application specific NoC mapping for optimized delay. In: Design and Test of Integrated Systems in Nanoscale Technology, DTIS 2006, 184–188, September 2006
Google Scholar

Download references

Acknowledgments

Dr. Unat is supported by the Marie Sklodowska Curie Reintegration Grant 655965 by the European Commission. Authors from KU are supported by the Turkish Science and Technology Research Centre Grant No: 215E285. Authors from LBNL were supported by the SciDAC Program and the Exascale Co-Design Program under the U.S. DOE contract DE-AC02-05CH11231. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. DOE under Contract No. DE-AC02-05CH11231. We would like to acknowledge and thank John Bell and Hakan Memisoglu for their input.

Author information

Authors and Affiliations

Koç University, Istanbul, Turkey
Didem Unat, Muhammed Nufail Farooqi & Burak Bastem
Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Tan Nguyen, Weiqun Zhang, George Michelogiannakis, Ann Almgren & John Shalf

Authors

Didem Unat
View author publications
You can also search for this author in PubMed Google Scholar
Tan Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Weiqun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Muhammed Nufail Farooqi
View author publications
You can also search for this author in PubMed Google Scholar
Burak Bastem
View author publications
You can also search for this author in PubMed Google Scholar
George Michelogiannakis
View author publications
You can also search for this author in PubMed Google Scholar
Ann Almgren
View author publications
You can also search for this author in PubMed Google Scholar
John Shalf
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Didem Unat .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum, Hamburg, Germany
Julian M. Kunkel
Argonne National Laboratory, Lemont, Illinois, USA
Pavan Balaji
University of Tennessee, Knoxville, Tennessee, USA
Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Unat, D. et al. (2016). TiDA: High-Level Programming Abstractions for Data Locality Management. In: Kunkel, J., Balaji, P., Dongarra, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9697. Springer, Cham. https://doi.org/10.1007/978-3-319-41321-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-41321-1_7
Published: 15 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41320-4
Online ISBN: 978-3-319-41321-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics