Skip to main content

Optimizing LOBPCG: Sparse Matrix Loop and Data Transformations in Action

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10136))

Abstract

Sparse matrix computations are widely used in iterative solvers; they are notoriously memory bound and typically yield poor performance on modern architectures. A common optimization strategy for such computations is to rely on specialized representations that exploit the nonzero structure of the sparse matrix in an application-specific way. Recent research has developed loop and data transformations for sparse matrix computations in a polyhedral compilation framework. In this paper, we apply these and additional loop transformations to a real application code, the LOBPCG solver, which performs a Sparse Matrix Multi-Vector (SpMM) computation at each iteration. The paper presents the transformation derivation for this application code and resulting performance. The compiler-generated code attains a speedup of up to 8.26\(\times \) on 8 threads on an Intel Haswell and 30 GFlops; it outperforms a state-of-the-art manually-written Fortran implementation by 3%.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Both compact and compact-and-pad use variations of the CHiLL compact command; a matrix is provided as an argument for compact-and-pad.

References

  1. Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia (2003)

    Book  MATH  Google Scholar 

  2. Montagne, E., Ekambaram, A.: An optimal storage format for sparse matrices. Inf. Process. Lett. 90(2), 87–92 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, p. 18. ACM (2009)

    Google Scholar 

  4. Im, E.-J., Yelick, K.A.: Optimizing the Performance of Sparse Matrix-Vector Multiplication. University of California, Berkeley (2000)

    Google Scholar 

  5. Anzt, H., Tomov, S., Dongarra, J.: Implementing a sparse matrix vector product for the sell-C/sell-C-\(\sigma \) formats on NVIDIA GPUs

    Google Scholar 

  6. Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for modern processors with wide SIMD units (2013). arXiv preprint arXiv:1307.6209

  7. Lowell, D., Godwin, J., Holewinski, J., Karthik, D., Choudary, C., Mametjanov, A., Norris, B., Sabin, G., Sadayappan, P., Sarich, J.: Stencil-aware GPU optimization of iterative solvers. SIAM J. Sci. Comput. 35(5), S209–S228 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  8. Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on GPUs. ACM SIGPLAN Not. 45(5), 115–126 (2010)

    Article  Google Scholar 

  9. Williams, S., Bell, N., Choi, J., Garland, M., Oliker, L., Vuduc, R.: Sparse matrix-vector multiplication on multicore and accelerators. In: Scientific Computing with Multicore and Accelerators, pp. 83–109 (2010)

    Google Scholar 

  10. Ancourt, C., Irigoin, F.: Scanning polyhedra with DO loops. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, April 1991

    Google Scholar 

  11. Kelly, W.A.: Optimization within a unified transformation framework. Ph.D. dissertation, University of Maryland, December 1996

    Google Scholar 

  12. Quilleré, F., Rajopadhye, S.: Generation of efficient nested loops from polyhedra. Int. J. Parallel Program. 28(5), 469–498 (2000)

    Article  Google Scholar 

  13. Vasilache, N., Bastoul, C., Cohen, A.: Polyhedral code generation in the real world. In: Proceedings of the 15th International Conference on Compiler Construction, March 2006

    Google Scholar 

  14. Chen, C.: Polyhedra scanning revisited. In: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2012, pp. 499–508, June 2012

    Google Scholar 

  15. Venkat, A., Hall, M., Strout, M.: Loop and data transformations for sparse matrix code. In: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (2015)

    Google Scholar 

  16. Aktulga, H.M., Buluc, A., Williams, S., Yang, C.: Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1213–1222. IEEE (2014)

    Google Scholar 

  17. Yamazaki, I., Dong, T., Solcà, R., Tomov, S., Dongarra, J., Schulthess, T.: Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems. Concurr. Comput.: Pract. Exp. 26(16), 2652–2666 (2013)

    Article  Google Scholar 

  18. Yamazaki, I., Tadano, H., Sakurai, T., Ikegami, T.: Performance comparison of parallel eigensolvers based on a contour integral method and a Lanczos method. Parallel Comput. 39(6), 280–290 (2013)

    Article  MathSciNet  Google Scholar 

  19. Campos, C., Roman, J.E.: Strategies for spectrum slicing based on restarted Lanczos methods. Numer. Algorithms 60(2), 279–295 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  20. Meerbergen, K., Vandebril, R.: A reflection on the implicitly restarted Arnoldi method for computing eigenvalues near a vertical line. Linear Algebra Appl. 436(8), 2828–2844 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  21. Morgan, R.B., Nicely, D.A.: Restarting the nonsymmetric Lanczos algorithm for eigenvalues and linear equations including multiple right-hand sides. SIAM J. Sci. Comput. 33(5), 3037–3056 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  22. Jiang, W., Wu, G.: A thick-restarted block Arnoldi algorithm with modified Ritz vectors for large eigenproblems. Comput. Math. Appl. 60(3), 873–889 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  23. Baker, A.H., Dennis, J.M., Jessup, E.R.: On improving linear solver performance: a block variant of GMRES. SIAM J. Sci. Comput. 27(5), 1608–1626 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  24. Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., van der Vorst, H.: Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, vol. 11. SIAM, Philadelphia (2000)

    Book  MATH  Google Scholar 

  25. Pinel, X., Montagnac, M.: Block Krylov methods to solve adjoint problems in aerodynamic design optimization. AIAA J. 51(9), 2183–2191 (2013)

    Article  Google Scholar 

  26. Bell, N., Garland, M.: Efficient sparse matrix-vector multiplication on CUDA. Nvidia technical report NVR-2008-004, Nvidia Corporation (2008)

    Google Scholar 

  27. Mirchandaney, R., Saltz, J.H., Smith, R.M., Nico, D.M., Crowley, K.: Principles of runtime support for parallel processors. In: Proceedings of the 2nd International Conference on Supercomputing, pp. 140–152 (1988)

    Google Scholar 

  28. Rauchwerger, L., Padua, D.: The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 1995 (1995)

    Google Scholar 

  29. Ravishankar, M., Eisenlohr, J., Pouchet, L.-N., Ramanujam, J., Rountev, A., Sadayappan, P.: Code generation for parallel execution of a class of irregular loops on distributed memory systems. In: Proceedings of SC 2012, November 2012

    Google Scholar 

  30. Basumallik, A., Eigenmann, R.: Optimizing irregular shared-memory applications for distributed-memory systems. In: Proceedings of the Symposium on Principles and Practice of Parallel Programming (2006)

    Google Scholar 

  31. Saltz, J., Chang, C., Edjlali, G., Hwang, Y.-S., Moon, B., Ponnusamy, R., Sharma, S., Sussman, A., Uysal, M., Agrawal, G., Das, R., Havlak, P.: Programming irregular applications: runtime support, compilation and tools. Adv. Comput. 45, 105–153 (1997)

    Article  Google Scholar 

  32. Ding, C., Kennedy, K.: Improving cache performance in dynamic applications through data, computation reorganization at run time. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design, Implementation, pp. 229–241. ACM, New York, May 1999

    Google Scholar 

  33. Mitchell, N., Carter, L., Ferrante, J.: Localizing non-affine array references. In: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 192–202, October 1999

    Google Scholar 

  34. Mellor-Crummey, J., Whalley, D., Kennedy, K.: Improving memory hierarchy performance for irregular applications using data and computation reorderings. Int. J. Parallel Program. 29(3), 217–247 (2001)

    Article  MATH  Google Scholar 

  35. Han, H., Tseng, C.-W.: Exploiting locality for irregular scientific codes. IEEE Trans. Parallel Distrib. Syst. 17(7), 606–618 (2006)

    Article  Google Scholar 

  36. Wu, B., Zhao, Z., Zhang, E.Z., Jiang, Y., Shen, X.: Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPoPP 2013 (2013)

    Google Scholar 

  37. Venkat, A., Shantharam, M., Hall, M., Strout, M.M.: Non-affine extensions to polyhedral code generation. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2014 (2014)

    Google Scholar 

  38. Kaleem, R., Venkat, A., Pai, S., Hall, M., Pingali, K.: Synchronization trade-offs in GPU implementations of graph algorithms. In: 30th IEEE International Parallel and Distributed Processing Symposium (2016)

    Google Scholar 

  39. Buluç, A., Fineman, J.T., Frigo, M., Gilbert, J.R., Leiserson, C.E.: Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In: Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, pp. 233–244. ACM (2009)

    Google Scholar 

  40. Bik, A., Wijshoff, H.A.: Advanced compiler optimizations for sparse computations. In: Supercomputing 1993 Proceedings, pp. 430–439, November 1993

    Google Scholar 

  41. Pugh, W., Shpeisman, T.: SIPR: a new framework for generating efficient code for sparse matrix computations. In: Proceedings of the Eleventh International Workshop on Languages and Compilers for Parallel Computing, August 1998

    Google Scholar 

  42. Mateev, N., Pingali, K., Stodghill, P., Kotlyar, V.: Next-generation generic programming and its application to sparse matrix computations. In: Proceedings of the 14th International Conference on Supercomputing, Santa Fe, New Mexico, USA, pp. 88–99, May 2000

    Google Scholar 

  43. Spek, H.L.A., Wijshoff, H.A.G.: Sublimation: expanding data structures to enable data instance specific optimizations. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds.) LCPC 2010. LNCS, vol. 6548, pp. 106–120. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19595-2_8

    Chapter  Google Scholar 

  44. Bik, A.J.C., Wijshoff, H.A.G.: On automatic data structure selection and code generation for sparse computations. In: Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds.) LCPC 1993. LNCS, vol. 768, pp. 57–75. Springer, Heidelberg (1994). doi:10.1007/3-540-57659-2_4

    Chapter  Google Scholar 

  45. Bik, A.J.C., Wijsho, H.A.G.: Automatic data structure selection and transformation for sparse matrix computations. IEEE Trans. Parallel Distrib. Syst. 7(2), 109–126 (1996)

    Article  Google Scholar 

  46. Tao, Y., Deng, Y., Mu, S., Zhang, Z., Zhu, M., Xiao, L., Ruan, L.: GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication. Concurr. Comput.: Practice Exp. 27(14), 3771–3789 (2015)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khalid Ahmad .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Ahmad, K., Venkat, A., Hall, M. (2017). Optimizing LOBPCG: Sparse Matrix Loop and Data Transformations in Action. In: Ding, C., Criswell, J., Wu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2016. Lecture Notes in Computer Science(), vol 10136. Springer, Cham. https://doi.org/10.1007/978-3-319-52709-3_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-52709-3_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-52708-6

  • Online ISBN: 978-3-319-52709-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics