Tuning HipGISAXS on Multi and Many Core Supercomputers
With the continual development of multi and many-core architectures, there is a constant need for architecture-specific tuning of application-codes in order to realize high computational performance and energy efficiency, closer to the theoretical peaks of these architectures. In this paper, we present optimization and tuning of HipGISAXS, a parallel X-ray scattering simulation code , on various massively-parallel state-of-the-art supercomputers based on multi and many-core processors. In particular, we target clusters of general-purpose multi-cores such as Intel Sandy Bridge and AMD Magny Cours, and many-core accelerators like Nvidia Kepler GPUs and Intel Xeon Phi coprocessors. We present both high-level algorithmic and low-level architecture-aware optimization and tuning methodologies on these platforms. We cover a detailed performance study of our codes on single and multiple nodes of several current top-ranking supercomputers. Additionally, we implement autotuning of many of the algorithmic and optimization parameters for dynamic selection of their optimal values to ensure high-performance and high-efficiency.
KeywordsThread Block Many Integrate Core Strong Scaling OpenMP Thread Kernel Fusion
Unable to display preview. Download preview PDF.
- 1.Tesla Kepler GPU Accelerators. Datasheet (2012)Google Scholar
- 2.Intel Xeon Phi Coprocessor. Developer’s Quick Start Guide. Version 1.5. White Paper (2013)Google Scholar
- 3.Performance Application Programming Interface (PAPI) (2013), http://icl.cs.utk.edu/papi
- 4.Top500 Supercomputers (June 2013), http://www.top500.org
- 5.Chourou, S., Sarje, A., Li, X., Chan, E., Hexemer, A.: HipGISAXS: A High Performance Computing Code for Simulating Grazing Incidence X-Ray Scattering Data. Submitted to the Journal of Applied Crystallography (2013)Google Scholar
- 6.Intel Corp.: Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual (September 2012)Google Scholar
- 7.Kim, C., Satish, N., Chhugani, J., et al.: Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology. Tech. Rep. (2011)Google Scholar
- 8.Pommier, J.: SIMD implementation of sin, cos, exp and log. Tech. Rep. (2007), http://gruntthepeon.free.fr/ssemath
- 9.Sarje, A., Li, X., Chourou, S., Chan, E., Hexemer, A.: Massively Parallel X-ray Scattering Simulations. In: Supercomputing (SC 2012) (2012)Google Scholar