Optimizing Excited-State Electronic-Structure Codes for Intel Knights Landing: A Case Study on the BerkeleyGW Software
We profile and optimize calculations performed with the BerkeleyGW [2, 3] code on the Xeon-Phi architecture. BerkeleyGW depends both on hand-tuned critical kernels as well as on BLAS and FFT libraries. We describe the optimization process and performance improvements achieved. We discuss a layered parallelization strategy to take advantage of vector, thread and node-level parallelism. We discuss locality changes (including the consequence of the lack of L3 cache) and effective use of the on-package high-bandwidth memory. We show preliminary results on Knights-Landing including a roofline study of code performance before and after a number of optimizations. We find that the GW method is particularly well-suited for many-core architectures due to the ability to exploit a large amount of parallelism over plane-wave components, band-pairs, and frequencies.
KeywordsMany Integrate Core Quantum Espresso Math Library Arithmetic Intensity Trip Count
Supported by the SciDAC Program on Excited State Phenomena in Energy Materials funded by the U.S. Department of Energy, Office of Basic Energy Sciences and of Advanced Scientific Computing Research, under Contract No. DE-AC02-05CH11231 at Lawrence Berkeley National Laboratory. Derek Vigil-Fowler is support by NREL’s LDRD Director’s Postdoctoral Fellowship. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
We acknowledge helpful conversations with Mike Greenfield, Paul Kent, David Prendergast and Pierre Carrier.
- 1.Cs roofline toolkit. https://bitbucket.org/berkeleylab/cs-roofline-toolkit
- 4.Frigo, M., Steven, G.J.: FFTW: an adaptive software architecture for the FFT. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 3, pp. 1381–1384. IEEE (1998)Google Scholar
- 5.Giannozzi, P., Baroni, S., Bonini, N., Calandra, M., Car, R., Cavazzoni, C., Ceresoli, D., Chiarotti, G.L., Cococcioni, M., Dabo, I., Dal Corso, A., Fabris, S., Fratesi, G., de Gironcoli, S., Gebauer, R., Gerstmann, U., Gougoussis, C., Kokalj, A., Lazzeri, M., Martin-Samos, L., Marzari, N., Mauri, F., Mazzarello, R., Paolini, S., Pasquarello, A., Paulatto, L., Sbraccia, C., Scandolo, S., Sclauzero, G., Seitsonen, A.P., Smogunov, A., Umari, P., Wentzcovitch, R.M.: J. Phys.: Condens. Matter 21, 395502 (2009). http://dx.doi.org/10.1088/0953-8984/21/39/395502 Google Scholar
- 9.Kronik, L., Makmal, A., Tiago, M.L., Alemany, M.M.G., Jain, M., Huang, X., Saad, Y., Chelikowsky, J.R.: PARSEC the pseudopotential algorithm for realspace electronic structure calculations: recent advances and novel applications to nanostructures. Phys. Status Solidi (b) 243(5), 1063–1079 (2006)CrossRefGoogle Scholar
- 10.NERSC. http://www.nersc.gov
- 11.NERSC: Cori. http://www.nersc.gov/systems/cori/
- 12.NERSC: Measuring arithmetic intensity. http://www.nersc.gov/users/application-performance/measuring-arithmetic-intensity
- 13.Pfrommer, B., Raczkowski, D., Canning, A., Louie. S.G.: PARATEC (PARAllel Total Energy Code), Lawrence Berkeley National Laboratory (with contributions from Mauri, F., Cote, M., Yoon, Y., Pickard, C., Heynes, P.). For more information see www.nersc.gov/projects/paratec. There is no corresponding record for this reference
- 14.Raman, K.: Calculating “flop” using intel software development emulator (intel sde), March 2015. https://software.intel.com/en-us/articles/calculating-flop-using-intel-software-development-emulator-intel-sde
- 16.Tal, A.: Intel software development emulator. https://software.intel.com/en-us/articles/intel-software-development-emulator
- 17.Williams, S.: Auto-tuning Performance on Multicore Computers. Ph.D. thesis, EECS Department, University of California, Berkeley, December 2008Google Scholar
- 18.Williams, S., Watterman, A., Patterson, D.: Roofline: An insightful visual performance model for floating-point programs and multicore architectures. Commun. ACM 52(4), April 2009Google Scholar
- 19.Williams, S.: Roofline performance model. http://crd.lbl.gov/departments/computer-science/PAR/research/roofline/