Optimizing LOBPCG: Sparse Matrix Loop and Data Transformations in Action

Ahmad, Khalid; Venkat, Anand; Hall, Mary

doi:10.1007/978-3-319-52709-3_17

Optimizing LOBPCG: Sparse Matrix Loop and Data Transformations in Action

Khalid Ahmad¹⁶,
Anand Venkat¹⁶ &
Mary Hall¹⁶

Conference paper
First Online: 24 January 2017

1000 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10136))

Abstract

Sparse matrix computations are widely used in iterative solvers; they are notoriously memory bound and typically yield poor performance on modern architectures. A common optimization strategy for such computations is to rely on specialized representations that exploit the nonzero structure of the sparse matrix in an application-specific way. Recent research has developed loop and data transformations for sparse matrix computations in a polyhedral compilation framework. In this paper, we apply these and additional loop transformations to a real application code, the LOBPCG solver, which performs a Sparse Matrix Multi-Vector (SpMM) computation at each iteration. The paper presents the transformation derivation for this application code and resulting performance. The compiler-generated code attains a speedup of up to 8.26\(\times \) on 8 threads on an Intel Haswell and 30 GFlops; it outperforms a state-of-the-art manually-written Fortran implementation by 3%.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Both compact and compact-and-pad use variations of the CHiLL compact command; a matrix is provided as an argument for compact-and-pad.

References

Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia (2003)
Book MATH Google Scholar
Montagne, E., Ekambaram, A.: An optimal storage format for sparse matrices. Inf. Process. Lett. 90(2), 87–92 (2004)
Article MathSciNet MATH Google Scholar
Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, p. 18. ACM (2009)
Google Scholar
Im, E.-J., Yelick, K.A.: Optimizing the Performance of Sparse Matrix-Vector Multiplication. University of California, Berkeley (2000)
Google Scholar
Anzt, H., Tomov, S., Dongarra, J.: Implementing a sparse matrix vector product for the sell-C/sell-C-\(\sigma \) formats on NVIDIA GPUs
Google Scholar
Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for modern processors with wide SIMD units (2013). arXiv preprint arXiv:1307.6209
Lowell, D., Godwin, J., Holewinski, J., Karthik, D., Choudary, C., Mametjanov, A., Norris, B., Sabin, G., Sadayappan, P., Sarich, J.: Stencil-aware GPU optimization of iterative solvers. SIAM J. Sci. Comput. 35(5), S209–S228 (2013)
Article MathSciNet MATH Google Scholar
Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on GPUs. ACM SIGPLAN Not. 45(5), 115–126 (2010)
Article Google Scholar
Williams, S., Bell, N., Choi, J., Garland, M., Oliker, L., Vuduc, R.: Sparse matrix-vector multiplication on multicore and accelerators. In: Scientific Computing with Multicore and Accelerators, pp. 83–109 (2010)
Google Scholar
Ancourt, C., Irigoin, F.: Scanning polyhedra with DO loops. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, April 1991
Google Scholar
Kelly, W.A.: Optimization within a unified transformation framework. Ph.D. dissertation, University of Maryland, December 1996
Google Scholar
Quilleré, F., Rajopadhye, S.: Generation of efficient nested loops from polyhedra. Int. J. Parallel Program. 28(5), 469–498 (2000)
Article Google Scholar
Vasilache, N., Bastoul, C., Cohen, A.: Polyhedral code generation in the real world. In: Proceedings of the 15th International Conference on Compiler Construction, March 2006
Google Scholar
Chen, C.: Polyhedra scanning revisited. In: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2012, pp. 499–508, June 2012
Google Scholar
Venkat, A., Hall, M., Strout, M.: Loop and data transformations for sparse matrix code. In: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (2015)
Google Scholar
Aktulga, H.M., Buluc, A., Williams, S., Yang, C.: Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1213–1222. IEEE (2014)
Google Scholar
Yamazaki, I., Dong, T., Solcà, R., Tomov, S., Dongarra, J., Schulthess, T.: Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems. Concurr. Comput.: Pract. Exp. 26(16), 2652–2666 (2013)
Article Google Scholar
Yamazaki, I., Tadano, H., Sakurai, T., Ikegami, T.: Performance comparison of parallel eigensolvers based on a contour integral method and a Lanczos method. Parallel Comput. 39(6), 280–290 (2013)
Article MathSciNet Google Scholar
Campos, C., Roman, J.E.: Strategies for spectrum slicing based on restarted Lanczos methods. Numer. Algorithms 60(2), 279–295 (2012)
Article MathSciNet MATH Google Scholar
Meerbergen, K., Vandebril, R.: A reflection on the implicitly restarted Arnoldi method for computing eigenvalues near a vertical line. Linear Algebra Appl. 436(8), 2828–2844 (2012)
Article MathSciNet MATH Google Scholar
Morgan, R.B., Nicely, D.A.: Restarting the nonsymmetric Lanczos algorithm for eigenvalues and linear equations including multiple right-hand sides. SIAM J. Sci. Comput. 33(5), 3037–3056 (2011)
Article MathSciNet MATH Google Scholar
Jiang, W., Wu, G.: A thick-restarted block Arnoldi algorithm with modified Ritz vectors for large eigenproblems. Comput. Math. Appl. 60(3), 873–889 (2010)
Article MathSciNet MATH Google Scholar
Baker, A.H., Dennis, J.M., Jessup, E.R.: On improving linear solver performance: a block variant of GMRES. SIAM J. Sci. Comput. 27(5), 1608–1626 (2006)
Article MathSciNet MATH Google Scholar
Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., van der Vorst, H.: Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, vol. 11. SIAM, Philadelphia (2000)
Book MATH Google Scholar
Pinel, X., Montagnac, M.: Block Krylov methods to solve adjoint problems in aerodynamic design optimization. AIAA J. 51(9), 2183–2191 (2013)
Article Google Scholar
Bell, N., Garland, M.: Efficient sparse matrix-vector multiplication on CUDA. Nvidia technical report NVR-2008-004, Nvidia Corporation (2008)
Google Scholar
Mirchandaney, R., Saltz, J.H., Smith, R.M., Nico, D.M., Crowley, K.: Principles of runtime support for parallel processors. In: Proceedings of the 2nd International Conference on Supercomputing, pp. 140–152 (1988)
Google Scholar
Rauchwerger, L., Padua, D.: The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 1995 (1995)
Google Scholar
Ravishankar, M., Eisenlohr, J., Pouchet, L.-N., Ramanujam, J., Rountev, A., Sadayappan, P.: Code generation for parallel execution of a class of irregular loops on distributed memory systems. In: Proceedings of SC 2012, November 2012
Google Scholar
Basumallik, A., Eigenmann, R.: Optimizing irregular shared-memory applications for distributed-memory systems. In: Proceedings of the Symposium on Principles and Practice of Parallel Programming (2006)
Google Scholar
Saltz, J., Chang, C., Edjlali, G., Hwang, Y.-S., Moon, B., Ponnusamy, R., Sharma, S., Sussman, A., Uysal, M., Agrawal, G., Das, R., Havlak, P.: Programming irregular applications: runtime support, compilation and tools. Adv. Comput. 45, 105–153 (1997)
Article Google Scholar
Ding, C., Kennedy, K.: Improving cache performance in dynamic applications through data, computation reorganization at run time. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design, Implementation, pp. 229–241. ACM, New York, May 1999
Google Scholar
Mitchell, N., Carter, L., Ferrante, J.: Localizing non-affine array references. In: Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 192–202, October 1999
Google Scholar
Mellor-Crummey, J., Whalley, D., Kennedy, K.: Improving memory hierarchy performance for irregular applications using data and computation reorderings. Int. J. Parallel Program. 29(3), 217–247 (2001)
Article MATH Google Scholar
Han, H., Tseng, C.-W.: Exploiting locality for irregular scientific codes. IEEE Trans. Parallel Distrib. Syst. 17(7), 606–618 (2006)
Article Google Scholar
Wu, B., Zhao, Z., Zhang, E.Z., Jiang, Y., Shen, X.: Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPoPP 2013 (2013)
Google Scholar
Venkat, A., Shantharam, M., Hall, M., Strout, M.M.: Non-affine extensions to polyhedral code generation. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2014 (2014)
Google Scholar
Kaleem, R., Venkat, A., Pai, S., Hall, M., Pingali, K.: Synchronization trade-offs in GPU implementations of graph algorithms. In: 30th IEEE International Parallel and Distributed Processing Symposium (2016)
Google Scholar
Buluç, A., Fineman, J.T., Frigo, M., Gilbert, J.R., Leiserson, C.E.: Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In: Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, pp. 233–244. ACM (2009)
Google Scholar
Bik, A., Wijshoff, H.A.: Advanced compiler optimizations for sparse computations. In: Supercomputing 1993 Proceedings, pp. 430–439, November 1993
Google Scholar
Pugh, W., Shpeisman, T.: SIPR: a new framework for generating efficient code for sparse matrix computations. In: Proceedings of the Eleventh International Workshop on Languages and Compilers for Parallel Computing, August 1998
Google Scholar
Mateev, N., Pingali, K., Stodghill, P., Kotlyar, V.: Next-generation generic programming and its application to sparse matrix computations. In: Proceedings of the 14th International Conference on Supercomputing, Santa Fe, New Mexico, USA, pp. 88–99, May 2000
Google Scholar
Spek, H.L.A., Wijshoff, H.A.G.: Sublimation: expanding data structures to enable data instance specific optimizations. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds.) LCPC 2010. LNCS, vol. 6548, pp. 106–120. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19595-2_8
Chapter Google Scholar
Bik, A.J.C., Wijshoff, H.A.G.: On automatic data structure selection and code generation for sparse computations. In: Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds.) LCPC 1993. LNCS, vol. 768, pp. 57–75. Springer, Heidelberg (1994). doi:10.1007/3-540-57659-2_4
Chapter Google Scholar
Bik, A.J.C., Wijsho, H.A.G.: Automatic data structure selection and transformation for sparse matrix computations. IEEE Trans. Parallel Distrib. Syst. 7(2), 109–126 (1996)
Article Google Scholar
Tao, Y., Deng, Y., Mu, S., Zhang, Z., Zhu, M., Xiao, L., Ruan, L.: GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication. Concurr. Comput.: Practice Exp. 27(14), 3771–3789 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Utah, Salt Lake City, Utah, 84112, USA
Khalid Ahmad, Anand Venkat & Mary Hall

Authors

Khalid Ahmad
View author publications
You can also search for this author in PubMed Google Scholar
Anand Venkat
View author publications
You can also search for this author in PubMed Google Scholar
Mary Hall
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khalid Ahmad .

Editor information

Editors and Affiliations

University of Rochester , Rochester, New York, USA
Chen Ding
University of Rochester , Rochester, New York, USA
John Criswell
Huawei Inc. , Santa Clara, California, USA
Peng Wu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ahmad, K., Venkat, A., Hall, M. (2017). Optimizing LOBPCG: Sparse Matrix Loop and Data Transformations in Action. In: Ding, C., Criswell, J., Wu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2016. Lecture Notes in Computer Science(), vol 10136. Springer, Cham. https://doi.org/10.1007/978-3-319-52709-3_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-52709-3_17
Published: 24 January 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-52708-6
Online ISBN: 978-3-319-52709-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics