OpenMP Parallelization and Optimization of Graph-Based Machine Learning Algorithms
We investigate the OpenMP parallelization and optimization of two novel data classification algorithms. The new algorithms are based on graph and PDE solution techniques and provide significant accuracy and performance advantages over traditional data classification algorithms in serial mode. The methods leverage the Nystrom extension to calculate eigenvalue/eigenvectors of the graph Laplacian and this is a self-contained module that can be used in conjunction with other graph-Laplacian based methods such as spectral clustering. We use performance tools to collect the hotspots and memory access of the serial codes and use OpenMP as the parallelization language to parallelize the most time-consuming parts. Where possible, we also use library routines. We then optimize the OpenMP implementations and detail the performance on traditional supercomputer nodes (in our case a Cray XC30), and test the optimization steps on emerging testbed systems based on Intel’s Knights Corner and Landing processors. We show both performance improvement and strong scaling behavior. A large number of optimization techniques and analyses are necessary before the algorithm reaches almost ideal scaling.
KeywordsSemi-supervised Unsupervised Data Algorithms OpenMP Optimization
This work was supported by NSF grants DMS-1417674 and DMS-1045536 and AFOSR MURI grant FA9550-10-1-0569. We would like to thank Dr. Da Kuang for his suggestions on optimizing the serial codes. This work was also supported by U.S. Department of Energy under Contract No. DE-AC02-05CH11231. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
- 1.Meng, Z., Merkurjev, E., Koniges, A., Bertozzi, A.L.: Hyperspectral Video Analysis Using Graph Clustering Methods. Image Processing On Line, submittedGoogle Scholar
- 3.Szlam, A., Bresson, X.: A total variation-based graph clustering algorithm for cheeger ratio cuts. UCLA CAM Report, pp. 09–68 (2009)Google Scholar
- 9.Luo, X., Bertozzi, A.L.: Convergence analysis of the graph Allen-Cahn scheme. PreprintGoogle Scholar
- 12.Merkurjev, E., Bae, E., Bertozzi, A.L., Tai, X.C.: Global binary optimization on graphs for classification of high-dimensional data. J. Math. Imaging Vis. 52(3), 414–435Google Scholar
- 13.Hu, H., Sunu, J., Bertozzi, A.L.: Multi-class graph Mumford-Shah model for plume detection using the MBO scheme. In: Tai, X.-C., Bae, E., Chan, T.F., Lysaker, M. (eds.) EMMCVPR 2015. LNCS, vol. 8932, pp. 209–222. Springer, Heidelberg (2015)Google Scholar
- 14.Kuang, D., Gittens, A., Hamid, R.: Hardware compliant approximate image codes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
- 16.Broadwater, J.B., Limsui, D., Carr, A.K.: A primer for chemical plume detection using LWIR sensors. Technical Paper, National Security Technology Department, Las Vegas, NV (2011)Google Scholar
- 18.Rooine Toolkit: https://bitbucket.org/berkeleylab/cs-roofline-toolkit
- 19.Intel Software Development Emulator. https://software.intel.com/en-us/articles/intel-software-development-emulator
- 20.Doerfler, D.: Understanding Application Data Movement Characteristics using Intel VTune Amplifier and Software Development Emulator tools, Intel Xeon Phi User Group (IXPUG) (2015)Google Scholar
- 21.Intel VTune Official Website. https://software.intel.com/en-us/intel-vtune-amplifier-xe
- 22.Cori Website: https://www.nersc.gov/users/computational-systems/cori