Optimization of the Sparse Matrix-Vector Products of an IDR Krylov Iterative Solver in EMGeo for the Intel KNL Manycore Processor

  • Tareq MalasEmail author
  • Thorsten Kurth
  • Jack Deslippe
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9945)


In geophysical-imaging, medium properties can be studied by performing scattering experiments using electromagnetic or seismic waves. Quantities such as densities, elasticities, stress etc. can be obtained from fitting the observed measurements to the results predicted by a simulation. The EMGeo software performs these simulations and solves the inverse scattering problem in the Laplace-Fourier domain. In this paper, we focus on the Seismic part and forward step of the inverse scattering problem, which involves inverting a large sparse matrix. For this purpose, EMGeo uses an Induced Dimensional Reduction (IDR) Krylov subspace solver. The Sparse Matrix Vector (SpMV) product is responsible for more than half of the total runtime. We demonstrate how we use spatial and multiple Right Hand Side (RHS) blocking cache optimizations to increase arithmetic intensity and thus the performance, as SpMV product is memory bandwidth-bound. Our optimizations achieve \(5.0\times \) and \(4.8 \times \) speedup in the SpMV product in Haswell and KNL processors, respectively. We also achieve \(1.8\times \) and \(3.3 \times \) speedup in the overall IDR solver in Haswell and KNL processors, respectively. We also give an outlook over possible future optimizations.


Intel knights landing optimization Matrix vector product optimization IDR Krylov solver optimization Multiple right-hand side blocking Spatial blocking 



This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.


  1. 1.
    Datta, K.: Auto-tuning stencil codes for cache-based multicore platforms. Ph.D. thesis, EECS Department, University of California, Berkeley.
  2. 2.
    Gropp, W., Kaushik, D., Keyes, D., Smith, B.: Toward realistic performance bounds for implicit CFD codes. In: Proceedings of parallel CFD, vol. 99, pp. 233–240. Citeseer (1999)Google Scholar
  3. 3.
    Kreutzer, M., Thies, J., Röhrig-Zöllner, M., Pieper, A., Shahzad, F., Galgon, M., Basermann, A., Fehske, H., Hager, G., Wellein, G.: GHOST: building blocks for high performance sparse linear algebra on heterogeneous systems abs/1507.08101 (2015).
  4. 4.
    Malas, T., Hager, G., Ltaief, H., Stengel, H., Wellein, G., Keyes, D.: Multicore-optimized wavefront diamond blocking for optimizing stencil updates. SIAM J. Sci. Comput. 37(4), C439–C464 (2015). doi: 10.1137/140991133 MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Malas, T.M.: Tiling and asynchronous communication optimizations for stencil computations. Ph.D. thesis, King Abdullah University of Science and Technology, December 2015Google Scholar
  6. 6.
    Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds.) HiPEAC 2010. LNCS, vol. 5952, pp. 111–125. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-11515-8_10 CrossRefGoogle Scholar
  7. 7.
  8. 8.
    Petrov, P.V., Newman, G.A.: Three-dimensional inverse modelling of damped elastic wave propagation in the fourier domain. Geophys. J. Int. 198(3), 1599–1617 (2014)CrossRefGoogle Scholar
  9. 9.
    Petrov, P.V., Newman, G.A.: 3d finite-difference modeling of elastic wave propagation in the laplace-fourier domain. Geophysics 77(4), T137–T155 (2012). doi: 10.1190/geo2011-0238.1 CrossRefGoogle Scholar
  10. 10.
    Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. In: Proceedings of the 29th ACM on International Conference on Supercomputing, pp. 207–216. ACM (2015)Google Scholar
  11. 11.
  12. 12.
    Williams, S.: Auto-tuning performance on multicore computers. Ph.D. thesis, EECS Department, University of California, Berkeley, December 2008Google Scholar
  13. 13.
    Williams, S., Watterman, A., Patterson, D.: Roofline: an insightful visual performance model for floating-point programs and multicore architectures. Commun. ACM. 52(4), 65–76 (2009)CrossRefGoogle Scholar
  14. 14.
    Williams, S., Stralen, B.V., Ligocki, T., Oliker, L., Cordery, M., Lo, L.: Roofline performance model.

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.National Energy Research Scientific Computing CenterLawrence Berkeley National LaboratoryBerkeleyUSA

Personalised recommendations