Abstract
“ARTED” is an advanced scientific code for electron dynamics simulation which has been ported to various large-scale parallel systems including the “K” Computer, the ex-fastest supercomputer in the world, and many other MPP and cluster systems.
In this paper, we describe ARTED’s code optimization and performance evaluation applied to a large-scale cluster with Intel’s latest many-core processor, KNL (Knights Landing), based on past research regarding porting ARTED to the KNC (Knights Corner) coprocessor. Code optimization for dominant computation has been thoroughly carried out in KNL to achieve the highest performance with detailed optimization such as memory access, vectorization for the AVX-512 instruction set, cache utilization, etc. For further tuning, we investigated various KNL-dedicated techniques such as combining MCDRAM/DDR4 memories and parallel vector summation.
After detailed performance tuning on each core to achieve up to 25% of theoretical peak in the kernel part with 3-D stencil computation, we evaluated the application performance on the full system (25 PFLOPS of theoretical peak) of the KNL cluster “Oakforest-PACS” which is the largest KNL-based cluster in the world using the Intel Omni-Path Architecture. It shows excellent weak scaling with a dominant Hamiltonian performance of up to 4 PFLOPS (16% efficiency of the system) in double precision irrespective of simulation size as well as reasonable strong scaling on material simulations requiring high degree of parallelism.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
There are several larger KNL-based MPPs such as the Cray XC40 series; however Oakforest-PACS is still the largest cluster.
- 2.
Currently, an open-source optoelectronics application package named “SALMON” [26] is partly based on ARTED.
- 3.
This is not exactly correct, as MCDRAM is used for the direct-map cache in cache-mode, where line conflicts between different arrays may occur.
References
Sato, S.A., Yabana, K.: Maxwell + TDDFT multi-scale simulation for laser-matter interactions. J. Adv. Simulat. Sci. Eng. 1(1), 98–110 (2014)
Yabana, K., Sugiyama, T., Shinohara, Y., et al.: Time-dependent density functional theory for strong electromagnetic fields in crystalline solids. Phys. Rev. B 85(4), 11 (2012). https://doi.org/10.1103/PhysRevB.85.045134
Andrade, X., et al.: Time-dependent density-functional theory in massively parallel computer architectures: the OCTOPUS project. J. Phy. Condens. Matt. 24, 233202 (2012)
Noda, M., Ishimura, K., Nobusada, K., et al.: Massively-parallel electron dynamics calculations in real-time and real-space: toward applications to nanostructures of more than ten-nanometers in size. J. Comput. Phys. 265(14), 145–155 (2014)
Draeger, E.W., Andrade, X., Gunnels, J.A., et al.: Massively parallel first-principles simulation of electron dynamics in materials. In: 2016 IEEE International Parallel and Distributed Processing Symposium, p. 832 (2016)
Barnes, T., Cook, B., Deslippe, J., et al.: Evaluating and optimizing the NERSC workload on Knights Landing. In: Proceedings of the 7th International Workshop on PMBS 2016, pp. 43–53 (2016)
Rosales, C., Cazes, J., Milfeld, K., Gómez-Iglesias, A., Koesterke, L., Huang, L., Vienne, J.: A comparative study of application performance and scalability on the Intel Knights Landing processor. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 307–318. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46079-6_22
Joó, B., Kalamkar, D.D., Kurth, T., Vaidyanathan, K., Walden, A.: Optimizing Wilson-Dirac operator and linear solvers for Intel® KNL. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 415–427. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46079-6_30
Yount, C., Duran, A.:: Effective use of large high-bandwidth memory caches in HPC stencil computation via temporal wave-front tiling. In: Proceedings of the 7th International Workshop on PMBS 2016, pp. 65–75 (2016)
Hofmann, J., Treibig, J., Hager, G., Wellein, G.: Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips. In: Proceedings of WPMVP 2014, pp. 55–64 (2014)
Andreolli, C.: Eight Optimizations for 3-Dimensional Finite Difference (3DFD) Code with an Isotropic (ISO). https://software.intel.com/en-us/articles/eight-optimizations-for-3-dimensional-finite-difference-3dfd-code-with-an-isotropic-iso
Blelloch, G.E.: Prefix Sums and Their Applications, School of Computer Science, Carnegie Mellon University, CMU-CS-90-190, November 1990
Martin, P.J., Ayuso, L.F., Torres, R., Gavilanes, A.: Algorithmic strategies for optimizing the parallel reduction primitive in CUDA. In: 2012 International Conference on High Performance Computing and Simulation, pp. 511–519, July 2012
Sodani, A.: Knights Landing (KNL): 2nd generation intel Xeon Phi processor. IEEE Hot Chips 27, 1–24 (2015)
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Hirokawa, Y., Boku, T., Sato, S.A., Yabana, K.: Electron dynamics simulation with time-dependent density functional theory on large scale symmetric mode Xeon Phi cluster. In: The 17th IEEE International Workshop on PDSEC 2016 (2016)
Schultze, M., Ramasesha, K., Pemmaraju, C., et al.: Attosecond band-gap dynamics in Silicon. Science 346(6215), 1348–1352 (2014)
Lucchini, M., Sato, S.A., Ludwig, A., et al.: Attosecond dynamical Franz-Keldysh effect in polycrystalline diamond. Science 353(6302), 916–919 (2016)
Malinauskas, M., Zukauskas, A., Hasegawa, S., et al.: Ultrafast laser processing of materials: from science to industry. Light Sci. Appl. 5, e16133 (2016)
RIKEN AICS. http://www.aics.riken.jp/en/
CCS, University of Tsukuba. http://www.ccs.tsukuba.ac.jp/eng/
Joint Center for Advanced HPC. http://jcahpc.jp/eng/
TOP500. http://www.top500.org/
OCTOPUS. http://octopus-code.org
Github: ARTED. https://github.com/ARTED/ARTED
SALMON. http://salmon-tddft.jp/
Acknowledgment
A part of this research was based on the Oakforest-PACS system operated at the JCAHPC in cooperation with the Information Technology Center at University of Tokyo and the Center for Computational Sciences at University of Tsukuba. The performance evaluation applied in this paper using the COMA system was supported in part by the interdisciplinary collaborative research program at the Center for Computational Sciences, University of Tsukuba. This work was also supported by CREST, JST (Grant No. JPMJCR16N5).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Hirokawa, Y., Boku, T., Uemoto, M., Sato, S.A., Yabana, K. (2018). Performance Optimization and Evaluation of Scalable Optoelectronics Application on Large Scale KNL Cluster. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds) High Performance Computing. ISC High Performance 2018. Lecture Notes in Computer Science(), vol 10876. Springer, Cham. https://doi.org/10.1007/978-3-319-92040-5_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-92040-5_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92039-9
Online ISBN: 978-3-319-92040-5
eBook Packages: Computer ScienceComputer Science (R0)