Performance Optimization and Evaluation of Scalable Optoelectronics Application on Large Scale KNL Cluster

Hirokawa, Yuta; Boku, Taisuke; Uemoto, Mitsuharu; Sato, Shunsuke A.; Yabana, Kazuhiro

doi:10.1007/978-3-319-92040-5_11

Yuta Hirokawa¹⁷,
Taisuke Boku^17,18,
Mitsuharu Uemoto¹⁸,
Shunsuke A. Sato¹⁹ &
…
Kazuhiro Yabana¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10876))

Included in the following conference series:

International Conference on High Performance Computing

1853 Accesses
1 Citations

Abstract

“ARTED” is an advanced scientific code for electron dynamics simulation which has been ported to various large-scale parallel systems including the “K” Computer, the ex-fastest supercomputer in the world, and many other MPP and cluster systems.

In this paper, we describe ARTED’s code optimization and performance evaluation applied to a large-scale cluster with Intel’s latest many-core processor, KNL (Knights Landing), based on past research regarding porting ARTED to the KNC (Knights Corner) coprocessor. Code optimization for dominant computation has been thoroughly carried out in KNL to achieve the highest performance with detailed optimization such as memory access, vectorization for the AVX-512 instruction set, cache utilization, etc. For further tuning, we investigated various KNL-dedicated techniques such as combining MCDRAM/DDR4 memories and parallel vector summation.

After detailed performance tuning on each core to achieve up to 25% of theoretical peak in the kernel part with 3-D stencil computation, we evaluated the application performance on the full system (25 PFLOPS of theoretical peak) of the KNL cluster “Oakforest-PACS” which is the largest KNL-based cluster in the world using the Intel Omni-Path Architecture. It shows excellent weak scaling with a dominant Hamiltonian performance of up to 4 PFLOPS (16% efficiency of the system) in double precision irrespective of simulation size as well as reasonable strong scaling on material simulations requiring high degree of parallelism.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
There are several larger KNL-based MPPs such as the Cray XC40 series; however Oakforest-PACS is still the largest cluster.
2.
Currently, an open-source optoelectronics application package named “SALMON” [26] is partly based on ARTED.
3.
This is not exactly correct, as MCDRAM is used for the direct-map cache in cache-mode, where line conflicts between different arrays may occur.

References

Sato, S.A., Yabana, K.: Maxwell + TDDFT multi-scale simulation for laser-matter interactions. J. Adv. Simulat. Sci. Eng. 1(1), 98–110 (2014)
Article Google Scholar
Yabana, K., Sugiyama, T., Shinohara, Y., et al.: Time-dependent density functional theory for strong electromagnetic fields in crystalline solids. Phys. Rev. B 85(4), 11 (2012). https://doi.org/10.1103/PhysRevB.85.045134
Article Google Scholar
Andrade, X., et al.: Time-dependent density-functional theory in massively parallel computer architectures: the OCTOPUS project. J. Phy. Condens. Matt. 24, 233202 (2012)
Article Google Scholar
Noda, M., Ishimura, K., Nobusada, K., et al.: Massively-parallel electron dynamics calculations in real-time and real-space: toward applications to nanostructures of more than ten-nanometers in size. J. Comput. Phys. 265(14), 145–155 (2014)
Article Google Scholar
Draeger, E.W., Andrade, X., Gunnels, J.A., et al.: Massively parallel first-principles simulation of electron dynamics in materials. In: 2016 IEEE International Parallel and Distributed Processing Symposium, p. 832 (2016)
Google Scholar
Barnes, T., Cook, B., Deslippe, J., et al.: Evaluating and optimizing the NERSC workload on Knights Landing. In: Proceedings of the 7th International Workshop on PMBS 2016, pp. 43–53 (2016)
Google Scholar
Rosales, C., Cazes, J., Milfeld, K., Gómez-Iglesias, A., Koesterke, L., Huang, L., Vienne, J.: A comparative study of application performance and scalability on the Intel Knights Landing processor. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 307–318. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46079-6_22
Chapter Google Scholar
Joó, B., Kalamkar, D.D., Kurth, T., Vaidyanathan, K., Walden, A.: Optimizing Wilson-Dirac operator and linear solvers for Intel^® KNL. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 415–427. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46079-6_30
Chapter Google Scholar
Yount, C., Duran, A.:: Effective use of large high-bandwidth memory caches in HPC stencil computation via temporal wave-front tiling. In: Proceedings of the 7th International Workshop on PMBS 2016, pp. 65–75 (2016)
Google Scholar
Hofmann, J., Treibig, J., Hager, G., Wellein, G.: Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips. In: Proceedings of WPMVP 2014, pp. 55–64 (2014)
Google Scholar
Andreolli, C.: Eight Optimizations for 3-Dimensional Finite Difference (3DFD) Code with an Isotropic (ISO). https://software.intel.com/en-us/articles/eight-optimizations-for-3-dimensional-finite-difference-3dfd-code-with-an-isotropic-iso
Blelloch, G.E.: Prefix Sums and Their Applications, School of Computer Science, Carnegie Mellon University, CMU-CS-90-190, November 1990
Google Scholar
Martin, P.J., Ayuso, L.F., Torres, R., Gavilanes, A.: Algorithmic strategies for optimizing the parallel reduction primitive in CUDA. In: 2012 International Conference on High Performance Computing and Simulation, pp. 511–519, July 2012
Google Scholar
Sodani, A.: Knights Landing (KNL): 2nd generation intel Xeon Phi processor. IEEE Hot Chips 27, 1–24 (2015)
Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Article Google Scholar
Hirokawa, Y., Boku, T., Sato, S.A., Yabana, K.: Electron dynamics simulation with time-dependent density functional theory on large scale symmetric mode Xeon Phi cluster. In: The 17th IEEE International Workshop on PDSEC 2016 (2016)
Google Scholar
Schultze, M., Ramasesha, K., Pemmaraju, C., et al.: Attosecond band-gap dynamics in Silicon. Science 346(6215), 1348–1352 (2014)
Article Google Scholar
Lucchini, M., Sato, S.A., Ludwig, A., et al.: Attosecond dynamical Franz-Keldysh effect in polycrystalline diamond. Science 353(6302), 916–919 (2016)
Article Google Scholar
Malinauskas, M., Zukauskas, A., Hasegawa, S., et al.: Ultrafast laser processing of materials: from science to industry. Light Sci. Appl. 5, e16133 (2016)
Article Google Scholar
RIKEN AICS. http://www.aics.riken.jp/en/
CCS, University of Tsukuba. http://www.ccs.tsukuba.ac.jp/eng/
Joint Center for Advanced HPC. http://jcahpc.jp/eng/
TOP500. http://www.top500.org/
OCTOPUS. http://octopus-code.org
Github: ARTED. https://github.com/ARTED/ARTED
SALMON. http://salmon-tddft.jp/

Download references

Acknowledgment

A part of this research was based on the Oakforest-PACS system operated at the JCAHPC in cooperation with the Information Technology Center at University of Tokyo and the Center for Computational Sciences at University of Tsukuba. The performance evaluation applied in this paper using the COMA system was supported in part by the interdisciplinary collaborative research program at the Center for Computational Sciences, University of Tsukuba. This work was also supported by CREST, JST (Grant No. JPMJCR16N5).

Author information

Authors and Affiliations

Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Japan
Yuta Hirokawa & Taisuke Boku
Center for Computational Sciences, University of Tsukuba, Tsukuba, Japan
Taisuke Boku, Mitsuharu Uemoto & Kazuhiro Yabana
Max Planck Institute for the Structure and Dynamics of Matter, Hamburg, Germany
Shunsuke A. Sato

Authors

Yuta Hirokawa
View author publications
You can also search for this author in PubMed Google Scholar
Taisuke Boku
View author publications
You can also search for this author in PubMed Google Scholar
Mitsuharu Uemoto
View author publications
You can also search for this author in PubMed Google Scholar
Shunsuke A. Sato
View author publications
You can also search for this author in PubMed Google Scholar
Kazuhiro Yabana
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuta Hirokawa .

Editor information

Editors and Affiliations

Tokyo Institute of Technology, Tokyo, Japan
Rio Yokota
University of Edinburgh, Edinburgh, United Kingdom
Michèle Weiland
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
David Keyes
Technische Universität München, Garching bei München, Germany
Carsten Trinitis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hirokawa, Y., Boku, T., Uemoto, M., Sato, S.A., Yabana, K. (2018). Performance Optimization and Evaluation of Scalable Optoelectronics Application on Large Scale KNL Cluster. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds) High Performance Computing. ISC High Performance 2018. Lecture Notes in Computer Science(), vol 10876. Springer, Cham. https://doi.org/10.1007/978-3-319-92040-5_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-92040-5_11
Published: 29 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92039-9
Online ISBN: 978-3-319-92040-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics