High Performance LOBPCG Method for Solving Multiple Eigenvalues of Hubbard Model: Efficiency of Communication Avoiding Neumann Expansion Preconditioner
Abstract
The exact diagonalization method is a high accuracy numerical approach for solving the Hubbard model of a system of electrons with strong correlation. The method solves for the eigenvalues and eigenvectors of the Hamiltonian matrix derived from the Hubbard model. Since the Hamiltonian is a huge sparse symmetric matrix, it was expected that the LOBPCG method with an appropriate preconditioner could be used to solve the problem in a short time. This turned out to be the case as the LOBPCG method with a suitable preconditioner succeeded in solving the ground state (the smallest eigenvalue and its corresponding eigenvector) of the Hamiltonian. In order to solve for multiple eigenvalues of the Hamiltonian in a short time, we use a preconditioner based on the Neumann expansion which uses approximate eigenvalues and eigenvectors given by LOBPCG iteration. We apply a communication avoiding strategy, which was developed considering the physical properties of the Hubbard model, to the preconditioner. Our numerical experiments on two parallel computers show that the LOBPCG method coupled with the Neumann preconditioner and the communication avoiding strategy improves convergence and achieves excellent scalability when solving for multiple eigenvalues.
1 Introduction
The convergence of the LOBPCG method depends strongly on the use of a preconditioner. We previously confirmed that the zeroshift point Jacobi preconditioner, which is a shiftandinvert preconditioner [6] using an approximate eigenvalue, gives excellent convergence properties for the Hubbard model with the trapping potential [7]. However, we also reported that the benefit of the preconditioner strongly depends on the characteristics of the nonzero elements of the Hamiltonian and that the preconditioner does not always improve the convergence [8]. Therefore we proposed a novel preconditioner using the Neumann expansion for solving the ground state of the Hamiltonian and demonstrated that this preconditioner improves convergence for a Hamiltonian that is difficult to solve with the zeroshift point Jacobi preconditioner [8]. Moreover we applied a communication avoiding strategy, which was developed considering the properties of the Hubbard model, to the preconditioner.
In order to understand more details of strongly correlated electron systems in particular properties at temperatures near absolute zero, we must solve for the several smallest eigenvalues and corresponding eigenvectors of the Hamiltonian. The LOBPCG method can solve multiple eigenvalues by using a block of vectors.
In this paper, we extend the Neumann expansion preconditioner to the LOBPCG method for solving multiple eigenvalues and corresponding eigenvectors. Moreover, we demonstrate that the preconditioner improves the convergence properties and can achieve excellent parallel performance.
The paper is structured as follows. In Sect. 2 we briefly introduce related work for solving the ground state of the Hubbard model using the LOBPCG method. Section 3 describes the use of the Neumann expansion preconditioner with the communication avoiding strategy for solving for multiple eigenvalues and their corresponding eigenvectors. Section 4 demonstrates the parallel performance of the algorithm on the SGI ICE X and K supercomputers. A summary and conclusions are given in Sect. 5.
2 Related Work
2.1 HamiltonianVector Multiplication
The diagonal element in formula (1) is derived from the repulsive energy \(U_i\) in the corresponding state. The hopping parameter t affects nonzero elements with columnindex corresponding to the original state and rowindex corresponding to the state after hopping. Since the ratio U / t greatly affects the properties of the model, we have to execute many simulations varying this ratio to reveal the properties of the model.
 CAL 1:

\(Y^{c}=\bar{D}^{c}\odot V^{c}+A_{\uparrow } V^{c}\),
 COM 1:

alltoall communication from \(V^{c}\) to \(V^{r}\),
 CAL 2:

\(W^{r}= V^{r} A^T_{\downarrow }\),
 COM 2:

alltoall communication from \(W^{r}\) to \(W^{c}\),
 CAL 3:

\(Y^{c}=Y^{c}+W^{c}\).
2.2 Preconditioner of LOBPCG Method for Solving the Ground State of Hubbard Model
ZeroShift Point Jacobi Preconditioner. A suitable preconditioner improves the convergence properties of the LOBPCG method. As a consequence many preconditioners have been proposed. Preconditioners for the Hamiltonian derived from the Hubbard model also have been proposed. For the Hubbard model, the zeroshift point Jacobi (ZSPJ) preconditioner, which is a shiftandinvert preconditioner using an approximate eigenvalue obtained during LOBPCG iteration, has excellent convergence properties for Hamiltonians where the diagonal elements predominate over the offdiagonal elements, i.e. cases where the repulsive energy U is large [7, 8].
2.3 Communication Avoiding Neumann Expansion Preconditioner for Hubbard Model
 CAL 1:

\(Y^{c}=\bar{D}^{c}\odot V^{c}+A_{\uparrow } V^{c}\),
 COM 1:

alltoall communication from \(V^{c}\) to \(V^{r}\),
 CAL 2:

\(W^{r}=V^{r}A^{T}_\downarrow \),
 COM 2:

alltoall communication from \(W^{r}\) to \(W^{c}\),
 CAL 3:

\(Y_1^{c}=Y^{c}+W^{c}\),
 CAL 4:

\(Y^{c}=Y_1^{c}+W^{c}\),
 CAL 5:

\(Y^{c}=\bar{D}^{c}\odot Y_1^{c}+A_{\uparrow }Y^{c}\),
 CAL 6:

\(W^{r}=\bar{D}^{r}\odot V^{r}+W^{r}\),
 CAL 7:

\(W^{r}=W^{r} A^{T}_\downarrow \) ,
 COM 3:

alltoall communication from \(W^{r}\) to \(W^{c}\),
 CAL 8:

\(Y_2^{c}=Y^{c}+W^{c}\).
3 Neumann Expansion Preconditioner for Multiple Eigenvalues of Hubbard Model
3.1 How to Calculate Multiple Eigenvalues Using LOBPCG Method
3.2 Neumann Expansion Preconditioner of LOBPCG Method for Solving Multiple Eigenvalues
4 Performance Result
4.1 Computational Performance and Convergence Property
Details of SGI ICE X
Processor  Intel Xeon E52680v3 (2.5 GHz, 30 MB L2 cache) 
FLOPS per processor  480 GFLOPS 
Number of cores per CPU  12 
Number of processors per node  2 
Memory of node  64 GB 
Memory bandwidth  68 GB/s 
Network  Infini Band 6.8 GB/s 
Compiler  Intel compiler 
Elapsed time and number of iterations for convergence of LOBPCG method using zeroshift point Jacobi (ZSPJ), Neumann expansion (NE), or communication avoiding Neumann expansion (CANE) preconditioner. Here, s is the number of the Neumann expansion series.
(a) One eigenvalue (The ground state)  

Number of iterations (top) & Elapsed time (sec) (bottom)  
No precon.  PJ  ZSPJ  NE  CANE  
\(s=1\)  \(s=2\)  \(s=3\)  \(s=1\)  \(s=2\)  \(s=3\)  
\(U/t=1\)  133  133  132  69  59  46  69  59  46 
9.16  9.19  9.13  8.79  11.06  10.78  7.40  9.18  8.94  
\(U/t=10\)  184  132  124  95  81  65  94  81  64 
13.03  9.08  8.61  12.07  14.65  15.62  10.05  12.62  12.79 
(b) 5 eigenvalues  

Number of iterations (top) & Elapsed time (sec) (bottom)  
No precon.  PJ  ZSPJ  NE  CANE  
\(s=1\)  \(s=2\)  \(s=3\)  \(s=1\)  \(s=2\)  \(s=3\)  
\(U/t=1\)  199  190  168  81  77  59  86  72  54 
171.75  164.57  145.28  89.44  103.31  93.69  87.89  91.21  77.16  
\(U/t=10\)  293  217  240  159  156  108  155  142  105 
250.77  186.35  204.97  172.90  208.80  171.37  156.41  179.12  149.59 
(c) 10 eigenvalues  

Number of iterations (top) & Elapsed time (sec) (bottom)  
No precon.  PJ  ZSPJ  NE  CANE  
\(s=1\)  \(s=2\)  \(s=3\)  \(s=1\)  \(s=2\)  \(s=3\)  
\(U/t=1\)  551  777  624  319  257  184  340  302  198 
1221.55  1672.69  1369.91  911.56  897.79  759.40  863.57  936.35  680.48  
\(U/t=10\)  398  298  313  232  184  161  201  177  137 
996.19  740.98  763.42  720.22  704.14  705.03  579.98  607.46  515.60 
Elapsed time for operations per iteration. This table shows the results using the zeroshift point Jacobi (ZSPJ), Neumann expansion (NE), and communication avoiding Neumann expansion (CANE). Here, the Neumann expansion series s is equal to 1. For \(m=1\), instead of executing TSQR, we calculate \(S_B\) ,moreover, ZSPJ preconditioner is calculated together with x, p, X, P.
Elapse time per iteration (sec)  

\(m=1\)  \(m=5\)  \(m=10\)  
ZSPJ  NE  CANE  ZSPJ  NE  CANE  ZSPJ  NE  CANE  
Hw (& \(H^2w\))  0.061  0.117  0.100  0.276  0.545  0.448  0.568  1.088  0.909 
TSQR  —  —  —  0.407  0.408  0.407  1.498  1.503  1.502 
\(S_A\) (& \(S_B\))  0.007  0.007  0.007  0.073  0.073  0.073  0.255  0.257  0.254 
x, p, X, P  0.008  0.007  0.007  0.107  0.122  0.121  0.301  0.331  0.332 
Preconditioner  —  0.003  0.003  0.018  0.015  0.014  0.035  0.030  0.028 
Speedup ratio for the elapsed time per iteration using the Neumann expansion preconditioner and communication avoiding strategy.
Speedup ratio  

\(m=1\)  \(m=5\)  \(m=10\)  
\(s=1\)  \(s=2\)  \(s=3\)  \(s=1\)  \(s=2\)  \(s=3\)  \(s=1\)  \(s=2\)  \(s=3\)  
\(U/t=1\)  1.19  1.20  1.21  1.08  1.06  1.11  1.13  1.13  1.20 
\(U/t=10\)  1.19  1.16  1.20  1.08  1.06  1.11  1.08  1.12  1.16 
Next, we discuss the results for \(U/t=10\). The results indicate that the PJ preconditioner improves the convergence properties. On the other hand, ZSPJ for small m improves convergence, however, its convergence properties when solving for multiple eigenvalues are almost the same as those for the PJ preconditioner. When we solve for multiple eigenvalues using the Neumann expansion preconditioner, the solution is obtained faster than using the PJ or ZSPJ preconditioners. Moreover, as the Neumann expansion series s increases, the Neumann expansion preconditioner improves the convergence properties and the total elapsed time decreases, especially when m is large.
Details of K computer
Processor  SPARCTM 64 VIIIfx (2 GHz, 6 MB L2 cache) 
FLOPS per processor  128 GFLOPS 
Number of cores per CPU  8 
Number of processors per node  1 
Memory of node  16 GB 
Memory bandwidth  64 GB/s 
Network  Torus network (Tofu) 5 GB/s 
Compiler  Fujitsu compiler 
4.2 Parallel Performance
In order to examine the parallel performance of the LOBPCG method using the Neumann expansion preconditioner, we solved for the 10 smallest eigenvalues and corresponding eigenvectors of the Hamiltonian derived from the 4 \(\times \) 5site Hubbard model for \(U/t=1\) with 6 upspin and 6 downspin electrons. We used the LOBPCG method with ZSPJ, NE, and CANE preconditioners using hybrid parallelization on SGI ICEX in JAEA and the K computer in RIKEN (see Table 5). The results are shown in Table 6. The results indicate that all preconditioners achieve excellent parallel efficiency. The communication avoiding strategy on SGI ICEX decreases the elapsed time per iteration by about 15%. On the other hand, the communication avoiding strategy on the K computer did not realize speedup when using a small number of cores. The ratio of the network bandwidth to FLOPS per node of the K computer is larger than that of SGI ICEX, so it is possible that the cost of the extra calculations (CAL 4 & CAL 6) is larger than that of the alltoall communication operation. However since the cost of the alltoall communication operation increases as the number of the cores increases, the strategy realizes speedup on 4096 cores. Therefore, the strategy has a potential of speedup for parallel computing using a sufficiently large number of cores, even if the ratio of the network bandwidth to FLOPS is large.
Parallel performance of LOBPCG method on SGI ICEX and K computer. This table shows the number of iterations, the total elapsed time, and the elapsed time per iteration of LOBPCG method using zeroshift point Jacobi (ZSPJ), Neumann expansion (NE), or communication avoiding Neumann expansion (CANE) preconditioner. Here, the Neumann expansion series s is 3.
(a) SGI ICEX  

Number of iterations (top)  
Elapsed time (sec) (middle)  
Elapsed time per iteration (sec) (bottom)  
ZSPJ  NE  CANE  
64 MPI \(\times \) 12 OpenMP  591  226  225 
9501.694  5886.533  4921.302  
16.077  26.047  21.872  
128 MPI \(\times \) 12 OpenMP  605  246  229 
4611.478  3662.846  2909.048  
7.622  14.890  12.703  
256 MPI \(\times \) 12 OpenMP  601  244  226 
2259.070  2043.231  1603.456  
3.759  8.374  7.095 
(b) K computer  

Number of iterations (top)  
Elapsed time (sec) (middle)  
Elapsed time per iteration (sec) (bottom)  
ZSPJ  NE  CANE  
128 MPI \(\times \) 8 OpenMP  503  209  230 
5775.971  3752.884  4596.063  
11.483  17.956  19.983  
256 MPI \(\times \) 8 OpenMP  551  224  303 
3231.566  2085.268  2974.883  
5.865  9.309  9.818  
512 MPI \(\times \) 8 OpenMP  862  243  250 
2548.534  1327.093  1130.652  
2.957  5.461  4.523 
5 Conclusions
In this paper we applied the Neumann expansion preconditioner to the LOBPCG method to solve for multiple eigenvalues and corresponding eigenvectors of the Hamiltonian derived from the Hubbard model. We examined the convergence properties and parallel performance of the algorithms. Since the norm of the matrix used in the Neumann expansion should be less than 1, we transform it using approximate eigenvalues calculated by the LOBPCG iteration and the upper bounds of the eigenvalues by the Gershgorin circle theorem. Moreover, we orthogonalize the iteration vectors in the order that removes the components of the eigenvectors corresponding to the eigenvalues, whose absolute values are greater than or equal to 1, from the preconditioned vectors.
The Neumann expansion preconditioner with the communication avoiding strategy can achieve speedup even for problems which are hardly improved by the conventional preconditioners. Furthermore, a numerical experiment indicated that the LOBPCG method using this preconditioner has excellent parallel efficiency on thousands cores, and the communication avoiding strategy based on the property of the Hubbard model realizes speedup for parallel computers if a sufficiently large number of cores are used. Therefore, we confirm that the preconditioner based on the Neumann expansion is suitable for solving the eigenvalue problem derived from the Hubbard model using the LOBPCG method.
Notes
Acknowledgments
Computations in this study were performed on the SGI ICE X at the JAEA and the K computer at RIKEN Advanced Institute for Computational Science (project ID:ra000005). This research was partially supported by JSPS KAKENHI Grant Number 15K00178.
References
 1.Rasetti, M. (ed.): The Hubbard Model: Recent Results. World Scientific, Singapore (1991)Google Scholar
 2.Montorsi, A. (ed.): The Hubbard Model. World Scientific, Singapore (1992)Google Scholar
 3.Cullum, J.K., Willoughby, R.A.: Lanczos Algorithms for Large Symmetric Eigenvalue Computations, vol. 1: Theory. SIAM, Philadelphia (2002)Google Scholar
 4.Knyazev, A.V.: Preconditioned eigensolvers  an oxymoron? Electron. Trans. Numer. Anal. 7, 104–123 (1998)MathSciNetzbMATHGoogle Scholar
 5.Knyazev, A.V.: Toward the optimal eigensolver: locally optimal block preconditioned conjugate gradient method. SIAM J. Sci. Comput. 23, 517–541 (2001)MathSciNetCrossRefGoogle Scholar
 6.Saad, Y.: Numerical Methods for Large Eigenvalue Problems: Revised Edition. SIAM (2011)Google Scholar
 7.Yamada, S., Imamura, T., Machida, M.: 16.447 TFlops and 159Billiondimensional exactdiagonalization for trapped FermionHubbard Model on the Earth Simulator. In: Proceedings of SC 2005 (2005)Google Scholar
 8.Yamada, S., Imamura, T., Machida, M.: Communication avoiding Neumann expansion preconditioner for LOBPCG method: convergence property of exact diagonalization method for Hubbard model. In: Proceedings of ParCo 2017 (2017, accepted)Google Scholar
 9.Barrett, R., et al.: Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. SIAM, Philadelphia (1994)Google Scholar
 10.Langou, J.: AllReduce algorithms: application to Householder QR factorization. In: Proceedings of the 2007 International Conference on Preconditioning Techniques for Large Sparse Matrix Problems in Scientific and Industrial Applications, pp. 103–106 (2007)Google Scholar
 11.Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communicationavoiding paralleland sequential QR factorizations, Technical report, Electrical Engineering and Computer Sciences, University of California Berkeley (2008)Google Scholar
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.