Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture

Mitra, Gaurav; Stotzer, Eric; Jayaraj, Ajay; Rendell, Alistair P.

doi:10.1007/978-3-319-11454-5_15

Gaurav Mitra^20,21,
Eric Stotzer²⁰,
Ajay Jayaraj²⁰ &
…
Alistair P. Rendell²¹

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 8766))

Included in the following conference series:

International Workshop on OpenMP

995 Accesses
22 Citations
1 Altmetric

Abstract

The TI Keystone II architecture provides a unique combination of ARM Cortex-A15 processors with high performance TI C66x floating-point DSPs on a single low-power System-on-chip (SoC). Commercially available systems such as the HP Proliant m800 and nCore BrownDwarf are based on this ARM-DSP SoC. The Keystone II architecture promises to deliver high GFLOPS/Watt and is of increasing interest as it provides an alternate building block for future exascale systems. However, the success of this architecture is intimately related to the ease of migrating existing HPC applications for maximum performance. Effective use of all ARM and DSP cores and DMA co-processors is crucial for maximizing performance/watt. This paper explores issues and challenges encountered while migrating the matrix multiplication (GEMM) kernel, originally written only for the C6678 DSP to the ARM-DSP SoC using an early prototype of the OpenMP 4.0 accelerator model. Single precision (SGEMM) matrix multiplication performance of 110.11 GFLOPS and and double precision (DGEMM) performance of 29.15 GFLOPS was achieved on the TI Keystone II Evaluation Module Revision 3.0 (EVM). Trade-offs and factors affecting performance are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ali, M., Stotzer, E., Igual, F.D., van de Geijn, R.A.: Level-3 BLAS on the TI C6678 multi-core DSP. In: IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 179–186. IEEE (2012)
Google Scholar
Igual, F.D., Ali, M., Friedmann, A., Stotzer, E., Wentz, T., van de Geijn, R.A.: Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, p. 26. IEEE Computer Society Press (2012)
Google Scholar
HP: HP moonshot system (2014), http://h17007.www1.hp.com/us/en/enterprise/servers/products/moonshot/index.aspx
nCore HPC: ncore browndwarf y-class supercomputer (2014), http://ncorehpc.com/browndwarf/
Stotzer, E., Jayaraj, A., Ali, M., Friedmann, A., Mitra, G., Rendell, A.P., Lintault, I.: OpenMP on the Low-Power TI Keystone II ARM/DSP System-on-Chip. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 114–127. Springer, Heidelberg (2013)
Chapter Google Scholar
OpenMP ARB: OpenMP Application Program Interface, v.4.0 (2013), http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf
Texas Instruments Literature: SPRS866: 66AK2H12/06 Multicore DSP+ARM Keystone II System-on-Chip (SoC)
Google Scholar
Rajovic, N., Rico, A., Puzovic, N., Adeniyi-Jones, C., Ramirez, A.: Tibidabo: making the case for an ARM-based HPC system (2013)
Google Scholar
Mitra, G., Johnston, B., Rendell, A.P., McCreath, E., Zhou, J.: Use of SIMD vector operations to accelerate application code performance on low-powered ARM and Intel platforms. In: Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE (2013)
Google Scholar
Khronos: OpenCL: The open standard for parallel programming of heterogeneous systems (2011), http://www.khronos.org/opencl
Reyes, R., Lopez, I., Fumero, J.J., de Sande, F.: Directive-based programming for gpus: A comparative study. In: IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), pp. 410–417. IEEE (2012)
Google Scholar
Wolfe, M.: Implementing the PGI accelerator model. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 43–50. ACM (2010)
Google Scholar
Han, T.D., Abdelrahman, T.S.: Hi CUDA: A high-level directive-based language for GPU programming. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp. 52–61. ACM (2009)
Google Scholar
Ahmad, A., Ali, M., South, F., Monroy, G.L., Adie, S.G., Shemonski, N., Carney, P.S., Boppart, S.A.: Interferometric synthetic aperture microscopy implementation on a floating point multi-core digital signal processer. In: SPIE BiOS, International Society for Optics and Photonics, pp. 857134–857134 (2013)
Google Scholar
Note, F.W., Van Zee, F.G., Smith, T., Igual, F.D., Smelyanskiy, M., Zhang, X., Kistler, M., Austel, V., Gunnels, J., Low, T.M., et al.: Implementing level-3 blas with blis: Early experience (2013)
Google Scholar
NVIDIA: Unified Memory in CUDA 6 (2014), http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/
NVIDIA: NVIDIA Tegra K1 Processor (2014), http://www.nvidia.com/object/tegra-k1-processor.html
Liao, C., Yan, Y., de Supinski, B.R., Quinlan, D.J., Chapman, B.: Early Experiences With The OpenMP Accelerator Model. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 84–98. Springer, Heidelberg (2013)
Chapter Google Scholar
Schmidl, D., Cramer, T., Wienke, S., Terboven, C., Müller, M.S.: Assessing the performance of OpenMP programs on the Intel Xeon Phi. In: Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 547–558. Springer, Heidelberg (2013)
Chapter Google Scholar
Barker, J., Bowden, J.: Manycore Parallelism through OpenMP. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 45–57. Springer, Heidelberg (2013)
Chapter Google Scholar
Cramer, T., Schmidl, D., Klemm, M., an Mey, D.: OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison, pp. 38–44 (2012)
Google Scholar
Leang, S.S., Rendell, A.P., Gordon, M.S.: Quantum chemical calculations using accelerators: Migrating matrix operations to the nvidia kepler gpu and the intel xeon phi. Journal of Chemical Theory and Computation 10(3), 908–912 (2014)
Article Google Scholar
Newburn, C., Dmitriev, S., Narayanaswamy, R., Wiegert, J., Murty, R., Chinchilla, F., Deodhar, R., McGuire, R.: Offload Compiler Runtime for the Intel Xeon Phi Coprocessor. In: 2013 IEEE 27th International on Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), pp. 1213–1225 (May 2013)
Google Scholar
Li, B., Chang, H.C., Leon Song, S., Su, C.Y., Meyer, T., Mooring, J., Cameron, K.W.: The Power-Performance Tradeoffs of the Intel Xeon Phi on HPC Applications. In: Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Texas Instruments, Dallas, TX, USA
Gaurav Mitra, Eric Stotzer & Ajay Jayaraj
Australian National University, Canberra, ACT, Australia
Gaurav Mitra & Alistair P. Rendell

Authors

Gaurav Mitra
View author publications
You can also search for this author in PubMed Google Scholar
Eric Stotzer
View author publications
You can also search for this author in PubMed Google Scholar
Ajay Jayaraj
View author publications
You can also search for this author in PubMed Google Scholar
Alistair P. Rendell
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Cray Inc., Cray Plaza, 380 Jackson St., Suite 210, 55101, St. Paul, MN, USA
Luiz DeRose
Lawrence Livermore National Laboratory, 94551-0808, Livermore, CA, USA
Bronis R. de Supinski
Sandia National Laboratories, Albuquerque, NM, USA
Stephen L. Olivier
University of Houston, Houston, TX, USA
Barbara M. Chapman
RWTH Aachen, Aachen, Germany
Matthias S. Müller

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mitra, G., Stotzer, E., Jayaraj, A., Rendell, A.P. (2014). Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture. In: DeRose, L., de Supinski, B.R., Olivier, S.L., Chapman, B.M., Müller, M.S. (eds) Using and Improving OpenMP for Devices, Tasks, and More. IWOMP 2014. Lecture Notes in Computer Science, vol 8766. Springer, Cham. https://doi.org/10.1007/978-3-319-11454-5_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-11454-5_15
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11453-8
Online ISBN: 978-3-319-11454-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics