Sparse principal component regression via singular value decomposition approach

Abstract

Principal component regression (PCR) is a two-stage procedure: the first stage performs principal component analysis (PCA) and the second stage builds a regression model whose explanatory variables are the principal components obtained in the first stage. Since PCA is performed using only explanatory variables, the principal components have no information about the response variable. To address this problem, we present a one-stage procedure for PCR based on a singular value decomposition approach. Our approach is based upon two loss functions, which are a regression loss and a PCA loss from the singular value decomposition, with sparse regularization. The proposed method enables us to obtain principal component loadings that include information about both explanatory variables and a response variable. An estimation algorithm is developed by using the alternating direction method of multipliers. We conduct numerical studies to show the effectiveness of the proposed method.

Introduction

Principal component regression (PCR), invented by Jolliffe (1982) and Massy (1965), is widely used in various fields of research, including chemometrics, bioinformatics, and psychology, and has been extensively studied (Chang and Yang 2012; Dicker et al. 2017; Febrero-Bande et al. 2017; Frank and Friedman 1993; Hartnett et al. 1998; Reiss and Ogden 2007; Rosipal et al. 2001; Wang and Abbott 2008). PCR is a two-stage procedure: one first performs principal component analysis (PCA) (Jolliffe 2002; Pearson 1901), and then performs regression in which the explanatory variables are the selected principal components. However, the principal components have no information on the response variable. Because of this, the prediction accuracy of the PCR could be low, if the response variable is related to principal components having small eigenvalues.

To address this problem, a one-stage procedure for PCR was proposed in Kawano et al. (2015). This one-stage procedure was developed by combining a regression squared loss function with the sparse PCA (SPCA) loss function in Zou et al. (2006). The estimate of the regression parameter and loading matrix in the PCA is obtained as the minimizer of the combination of two loss functions with sparse regularization. By virtue of sparse regularization, sparse estimates of the parameters can be obtained. Kawano et al. (2015) referred to the one-stage procedure as sparse principal component regression (SPCR). Kawano et al. (2018) also extended SPCR within the framework of generalized linear models. However, it is unclear whether the PCA loss function in Zou et al. (2006) is the best choice for building SPCR, as there exist several formulae for PCA.

This paper proposes a novel formulation for SPCR. As a PCA loss for SPCR, we adopt a loss function based on a singular value decomposition approach (Shen and Huang 2008). Using the basic loss function, a combination of the PCA loss and the regression squared loss, with sparse regularization, we derive an alternative formulation for SPCR. We call the proposed method as sparse principal component regression based on a singular value decomposition approach (SPCRsvd). An estimation algorithm of SPCRsvd is developed using an alternating direction method of multipliers (Boyd et al. 2011) and a linearized alternating direction method of multipliers (Li et al. 2014; Wang and Yuan 2012). We show the effectiveness of SPCRsvd through numerical studies. Specifically, the performance of SPCRsvd is shown to be competitive with or better than that of SPCR.

As an alternative approach, partial least squares (PLS) (Frank and Friedman 1993; Wold 1975) is a widely used statistical method that regresses a response variable on composite variables built by combining a response variable and explanatory variables. In Chun and Keleş (2010), sparse partial least squares (SPLS) was proposed, which enables the removal of irrelevant explanatory variables when constructing the composite variables. PLS and SPLS are similar to SPCR and SPCRsvd in terms of using new explanatory variables with information relating the response variable to the original explanatory variables. Herein, these methods are compared using simulated data and real data.

The remainder of the paper is organized as follows. In Sect. 2, we review SPCA in Zou et al. (2006) and Shen and Huang (2008), and SPCR in Kawano et al. (2015). We present SPCRsvd in Sect. 3. Section 4 derives two computational algorithms for SPCRsvd and discusses the selection of tuning parameters. Monte Carlo simulations and real data analyses are presented in Sect. 5. Conclusions are given in Sect. 6.

Preliminaries

Sparse principal component analysis

PCA finds a loading matrix that induces a low-dimensional structure in the data. As an easy way to interpret the principal component loading matrix, SPCA has been proposed. To date, several formulae for SPCA have been proposed (Bresler et al. 2018; Chen et al. 2020; d’Aspremont et al. 2007; Erichson et al. 2020; Shen and Huang 2008; Vu et al. 2013; Witten et al. 2009; Zou et al. 2006). For an overview of SPCA, we refer the reader to Zou and Xue (2018) and the references therein. In this subsection, we review the two formulae for SPCA in Zou et al. (2006) and Shen and Huang (2008).

Let \(X=({\varvec{x}}_1, \ldots , {\varvec{x}}_n)^\top \) denote an \(n \times p\) data matrix, where n and p are the number of observations and the number of variables, respectively. Without loss of generality, we assume that the columns of the matrix X are centered. In Zou et al. (2006), SPCA was proposed as

$$\begin{aligned}&\min _{A, B} \left\{ \sum _{i=1}^n \Vert {\varvec{x}}_i - A B^\top {\varvec{x}}_i \Vert ^2_2 + \lambda \sum _{j=1}^k \Vert {\varvec{\beta }}_j \Vert ^2_2 + \sum _{j=1}^k \lambda _{1,j} \Vert {\varvec{\beta }}_j \Vert _1 \right\} \nonumber \\&\quad \ \mathrm{subject \ to} \ \ A^\top A = I_{k}, \end{aligned}$$
(1)

where \(A=({\varvec{\alpha }}_1,\ldots ,{\varvec{\alpha }}_k)\) and \(B=({\varvec{\beta }}_1,\ldots ,{\varvec{\beta }}_k)\) are \(p \times k\) principal component (PC) loading matrices, k denotes the number of principal components, \(I_k\) is the \(k \times k\) identity matrix, \(\lambda ,\lambda _{1,1},\ldots ,\lambda _{1,k}\) are non-negative regularization parameters, and \(\Vert \cdot \Vert _q\) is the \(L_q\) norm for an arbitrary finite vectors. This SPCA formulation can be regarded as a least squares approach. The first term represents performing PCA by least squares. The second and third terms represent sparse regularization similar to elastic net regularization (Zou and Hastie 2005). These terms enable us to set some of the estimates of B to zero. If \(\lambda =0\), then the regularization terms reduce to the adaptive lasso (Zou 2006).

A simple calculation gives

$$\begin{aligned} \min _{A, B} \sum _{j=1}^k \left\{ \Vert X {\varvec{\alpha }}_j - X {\varvec{\beta }}_j \Vert ^2_2 + \lambda \Vert {\varvec{\beta }}_j \Vert ^2_2 + \lambda _{1,j} \Vert {\varvec{\beta }}_j \Vert _1 \right\} \ \ \mathrm{subject \ to} \ \ A^\top A = I_{k}. \end{aligned}$$
(2)

Optimizing the parameters A and B for this minimization problem is straightforward. Given a fixed A, the SPCA problem (2) turns out to be a simple elastic net problem. Thus, the estimate of B can be obtained by the least angle regression algorithm (Efron et al. 2004) or the coordinate descent algorithm (Friedman et al. 2007; Wu and Lange 2008). Given a fixed B, an estimate of A can be obtained by solving the reduced rank Procrustes rotation problem (Zou et al. 2006). By alternating procedures, we can obtain the final estimates \({\hat{A}}\) and \({\hat{B}}\) of A and B, respectively. Note that only \({\hat{B}}\) is used as the principal component loading matrix.

Alternately, Shen and Huang (2008) proposed another formulation of SPCA, which can be regarded as a singular value decomposition (SVD) approach. Consider a low-rank approximation of the data matrix X obtained by SVD in the form

$$\begin{aligned} UDV^\top = \sum _{k=1}^r d_k {\varvec{u}}_k {\varvec{v}}_k^\top , \end{aligned}$$
(3)

where \(U=({\varvec{u}}_1,\ldots ,{\varvec{u}}_r)\) is an \(n \times r\) matrix with \(U^\top U = I_r\), \(V=({\varvec{v}}_1,\ldots ,{\varvec{v}}_r)\) is an \(r \times r\) orthogonal matrix, \(D = \mathrm{diag} (d_1,\ldots ,d_r)\), and \(r < \min (n,p)\). The singular values are assumed to be ordered such that \(d_r \ge \cdots \ge d_p \ge 0\). Using the connection between PCA and SVD, Shen and Huang (2008) obtained the sparse PC loading by estimating V with sparse regularization.

To achieve sparseness of V, Shen and Huang (2008) adopted the rank-one approximation procedure. First, the first PC loading vector \(\tilde{\varvec{v}}_1\) is obtained by solving the minimization problem

$$\begin{aligned} \min _{\tilde{\varvec{u}}_1, \tilde{\varvec{v}}_1} \left\{ \Vert X - \tilde{\varvec{u}}_1 \tilde{\varvec{v}}_1^\top \Vert _F^2 + \lambda P (\tilde{\varvec{v}}_1) \right\} \ \ \mathrm{subject \ to} \ \ \Vert \tilde{\varvec{u}}_1 \Vert _2=1. \end{aligned}$$
(4)

Here \(\tilde{\varvec{u}}_1, \tilde{\varvec{v}}_1\) are defined as rescaled vectors such that \(\tilde{\varvec{u}}_1 \tilde{\varvec{v}}_1^\top = d_1 {\varvec{u}}_1 {\varvec{v}}_1^\top \), \(P(\cdot )\) is a penalty function that induces the sparsity of \(\tilde{\varvec{v}}_1\), and \(\Vert \cdot \Vert _F\) is the Frobenius norm defined by \(\Vert A \Vert _F = \sqrt{ \mathrm{tr}(A^\top A) }\) for an arbitrary matrix A. As the penalty function, Shen and Huang (2008) used the lasso penalty (Tibshirani 1996), the hard-thresholding penalty (Donoho and Johnstone 1994), or the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li 2001). The rank-one approximation problem is easy to solve (4); see Algorithm 1 in Shen and Huang (2008). The remaining PC loading vectors are obtained by performing rank-one approximations of the corresponding residual matrices. For example, to derive the second PC loading vector \(\tilde{\varvec{v}}_2\), we solve the minimization problem

$$\begin{aligned} \min _{\tilde{\varvec{u}}_2, \tilde{\varvec{v}}_2} \left\{ \Vert X^\dagger - \tilde{\varvec{u}}_2 \tilde{\varvec{v}}_2^\top \Vert _F^2 + \lambda P (\tilde{\varvec{v}}_2) \right\} \ \ \ \mathrm{subject \ to} \ \ \ \Vert \tilde{\varvec{u}}_2 \Vert _2=1, \end{aligned}$$

where \(X^\dagger = X - \tilde{\varvec{u}}_1 \tilde{\varvec{v}}_1^\top \). The regularization parameter \(\lambda \) is selected by cross-validation.

Sparse principal component regression

For a one-dimensional continuous response variable Y and a p-dimensional explanatory variable \(\varvec{x}\), suppose we have obtained a dataset \(\{ (y_i, {\varvec{x}}_i) ; i=1,\ldots ,n \}\). We assume that the response variable is explained by variables composed by PCA of \(X = ({\varvec{x}}_1,\ldots , {\varvec{x}}_n)^\top \). Traditional PCR uses a regression model with a few PC scores corresponding to large eigenvalues. Note that these PC scores are derived by PCA prior to the regression. This two-stage procedure might then fail to predict the response if the response variable is related to PCs corresponding to small eigenvalues.

To attain a one-stage procedure for PCR, the SPCR proposed in Kawano et al. (2015) was formulated as the following minimization problem:

$$\begin{aligned}&\min _{A, B, \gamma _0, {\varvec{\gamma }}} \Bigg \{ \sum _{i=1}^n \left( y_i - \gamma _0 - {\varvec{\gamma }}^\top B^\top {\varvec{x}}_i \right) ^2 + w \sum _{i=1}^n \Vert {\varvec{x}}_i - A B^\top {\varvec{x}}_i \Vert ^2_2 \nonumber \\& + \lambda _{\beta } \xi \sum _{j=1}^k \Vert {\varvec{\beta }}_j \Vert ^2_2 + \lambda _{\beta } (1-\xi ) \sum _{j=1}^k \Vert {\varvec{\beta }}_{j} \Vert _1 + \lambda _{\gamma } \Vert {\varvec{\gamma }} \Vert _1\Bigg \} \\&\mathrm{subject \ to} \ \ \ A^\top A = I_{k}, \nonumber \end{aligned}$$
(5)

where \(\gamma _0\) is an intercept, \({\varvec{\gamma }} = (\gamma _1,\ldots ,\gamma _k)^\top \) comprises coefficients for regression, \( \lambda _{\beta }\) and \(\lambda _{\gamma }\) are non-negative regularization parameters, w is a positive tuning parameter, and \(\xi \) in [0, 1] is a tuning parameter. The first term in Formula (5) is the regression squared loss function including the PCs \(B^\top {\varvec{x}}\) as explanatory variables, while the second term is the PCA loss function used in SPCA in Zou et al. (2006). Sparse regularization in SPCR has two roles: sparseness and identifiability of parameters. For the identifiability by sparse regularization, we refer the reader to Choi et al. (2010), Jennrich (2006), Kawano et al. (2015). Kawano et al. (2018) also extended SPCR from the viewpoint of generalized linear models, which can deal with binary, count, and multi-categorical data for the response variable.

SVD-based sparse principal component regression

SPCR uses two basic loss functions: the regression squared loss function and the PCA loss function in Zou et al. (2006). However, it is unclear whether the PCA loss is the best choice for building SPCR. To investigate this issue, we propose another formulation for SPCR using the SVD approach in Shen and Huang (2008).

We consider the following minimization problem:

$$\begin{aligned}&\min _{\beta _0,{\varvec{\beta }}, Z, V} \left\{ \frac{1}{n} \Vert {\varvec{y}} - \beta _0 {\varvec{1}}_n - X V {\varvec{\beta }} \Vert _2^2 + \frac{w}{n} \Vert X - Z V^\top \Vert _F^2 + \lambda _V \Vert V \Vert _1 + \lambda _{\varvec{\beta }} \Vert \varvec{\beta }\Vert _1 \right\} \nonumber \\&\mathrm{subject \ to} \quad V^\top V = I_k, \end{aligned}$$
(6)

where \(\beta _0\) is an intercept, k is the number of PCs, \({\varvec{\beta }}\) is a k-dimensional coefficient vector, Z is an \(n \times k\) matrix of PCs, V is a \(p \times k\) PC loading matrix, and \(\varvec{1}_n\) is an n-dimensional vector of ones. In addition, w is a positive tuning parameter and \(\lambda _V,\lambda _{\varvec{\beta }}\) are non-negative regularization parameters.

The first term is the regression squared loss function relating the response and the PCs XV. The second term is the PCA loss function in the SVD approach in Shen and Huang (2008). Although the formula is seemingly different from the first term in Formula (4), they are essentially equivalent: we estimate the k PCs simultaneously, while Shen and Huang (2008) estimates them sequentially. The third and fourth terms constitute the lasso penalty that induces zero estimates of the parameters V and \(\varvec{\beta }\), respectively. The tuning parameter w controls the degree of the second term. A smaller value for w is used when our aim is to obtain better prediction accuracies, while a larger value for w is used when we want to obtain exact expressions of the PC loadings. The minimization problem (6) allows us to perform regression analysis and PCA simultaneously. We call this method SPCRsvd. In Sect. 5, we will observe that SPCRsvd is competitive with or better than SPCR through numerical studies.

We remark on two points here. First, it is possible to use Z in the first term of (6) instead of XV, since Z is also the PCs. However, the formulation with Z instead of XV did not perform well in numerical studies, so we adopt the formulation with XV here. Second, SPCR imposes a ridge penalty for the PC loading but SPCRsvd does not. The ridge penalty basically comes from SPCA in Zou et al. (2006). Because SPCRsvd is not based on SPCA in Zou et al. (2006), a ridge penalty does not appear in Formula (6). It is possible to add a ridge penalty and replace the lasso penalty with other penalties that induce sparsity, e.g., the adaptive lasso penalty, the SCAD penalty, or minimax concave penalty (Zhang 2010), but the our aim of this paper is to establish the basic procedure of Formula (6).

Implementation

Computational algorithm

To obtain the estimates of the parameters \({\varvec{\beta }}, Z, V\) in Formula (6), we employ the alternating direction method of multipliers (ADMM) and the linearized alternating direction method of multipliers (LADMM). ADMM and LADMM have recently been used in various models with sparse regularization; see, for example, Boyd et al. (2011); Danaher et al. (2014); Li et al. (2014); Ma and Huang (2017); Price et al. (2019); Tan et al. (2014); Wang et al. (2018); Yan and Bien (2020) and Ye and Xie (2011).

To solve the minimization problem (6) by using ADMM, we rewrite the problem as

$$\begin{aligned}&\min _{\beta _0, {\varvec{\beta }}, {\varvec{\beta }}_0, Z, V, V_0, V_1} \left\{ \frac{1}{n} \Vert {\varvec{y}} - \beta _0 {\varvec{1}}_n - X V_1 {\varvec{\beta }} \Vert _2^2 + \frac{w}{n} \Vert X \right. \nonumber \\&\left. - Z V^\top \Vert _F^2 + \lambda _V \Vert V_0 \Vert _1 + \lambda _{\varvec{\beta }} \Vert \varvec{\beta }_0 \Vert _1 \right\} \nonumber \\&\mathrm{subject \ to} \quad V^\top V = I_k, \quad V=V_0=V_1, \quad \varvec{\beta }= \varvec{\beta }_0. \end{aligned}$$
(7)

The scaled augmented Lagrangian for the problem (7) is then given by

$$\begin{aligned}&\frac{1}{n} \Vert {\varvec{y}} - \beta _0 {\varvec{1}}_n - X V_1 {\varvec{\beta }} \Vert _2^2 + \frac{w}{n} \Vert X - Z V^\top \Vert _F^2 + \lambda _V \Vert V_0 \Vert _1 + \lambda _{\varvec{\beta }} \Vert \varvec{\beta }_0 \Vert _1 \\&+ \frac{\rho _1}{2} \Vert V - V_0 + \Lambda _1 \Vert _F^2 +\frac{\rho _2}{2} \Vert V_1 - V_0 + \Lambda _2 \Vert _F^2 + \frac{\rho _3}{2} \Vert \varvec{\beta }- {\varvec{\beta }}_0 + {\varvec{\lambda }}_3 \Vert _2^2 \\&\mathrm{subject \ to} \quad V^\top V = I_k, \end{aligned}$$

where \(\Lambda _1, \Lambda _2, {\varvec{\lambda }}_3\) are dual variables and \(\rho _1, \rho _2, \rho _3 \ (>0)\) are penalty parameters. This gives rise to the following ADMM algorithm:

  1. Step 1

    Set the values of the tuning parameter w, the regularization parameters \(\lambda _V, \lambda _{\varvec{\beta }}\), and the penalty parameters \(\rho _1, \rho _2, \rho _3\).

  2. Step 2

    Initialize all the parameters as \(\beta _0^{(0)}, {\varvec{\beta }}^{(0)}, {\varvec{\beta }}_0^{(0)}, Z^{(0)}, V^{(0)}, V_0^{(0)}, V_1^{(0)},\Lambda _1^{(0)}, \) \(\Lambda _2^{(0)}, {\varvec{\lambda }}_3^{(0)}\).

  3. Step 3

    For \(m=0,1,2,\ldots \), repeat from Steps 4 to 11 until convergence.

  4. Step 4

    Update \(V_1\) as follows:

    $$\begin{aligned} \mathrm{vec} (V_1^{(m+1)})&= \left( \frac{1}{n} {\varvec{\beta }}^{(m)} {\varvec{\beta }}^{(m)\top } \otimes X^\top X + \frac{\rho _2}{2} I_k \otimes I_p \right) ^{-1} \\&\mathrm{vec} \bigg \{ \frac{1}{n} X^\top ({\varvec{y}} - \beta _0^{(m)} {\varvec{1}}_n) {\varvec{\beta }}^{(m)\top } \\& + \frac{\rho _2}{2} (V_0^{(m)} - \Lambda _2^{(m)}) \bigg \}, \end{aligned}$$

    where \(\otimes \) represents the Kronecker product.

  5. Step 5

    Update V as follows:

    $$\begin{aligned} V^{(m+1)}=PQ^\top , \end{aligned}$$

    where P and Q are the matrices given by the SVD

    $$\begin{aligned} \displaystyle {\frac{w}{n} X^\top Z^{(m)} + \frac{\rho _1}{2} \left( V_0^{(m)} - \Lambda _1^{(m)} \right) = P \varOmega Q^\top }. \end{aligned}$$
  6. Step 6

    Update \(V_0\) as follows:

    $$\begin{aligned}&v_{0ij}^{(m+1)} = {{\mathcal {S}}} \left( \frac{\rho _1(v_{ij}^{(m+1)} + \lambda _{1ij}^{(m)}) +\rho _2(v_{ij}^{(m+1)} + \lambda _{2ij}^{(m)})}{\rho _1 + \rho _2}, \frac{\lambda _V}{\rho _1 + \rho _2} \right) , \\&i=1,\ldots ,p, \ j=1,\ldots ,k, \end{aligned}$$

    where \(v_{0ij}^{(m)}=(V_0^{(m)})_{ij}\), \(v_{ij}^{(m)}=(V^{(m)})_{ij}\), \(\lambda _{\ell ij} \ (\ell =1,2)\) is the (ij)-th element of the matrix \(\Lambda _\ell \ (\ell =1,2)\), and \({{\mathcal {S}}} (\cdot ,\cdot ) \) is the soft-thresholding operator defined by \({{\mathcal {S}}}(x,\lambda )=\mathrm{sign}(x)(|x|-\lambda )_+\).

  7. Step 7

    Update Z by \(Z^{(m+1)}=X V^{(m+1)}\).

  8. Step 8

    Update \({\varvec{\beta }}\) as follows:

    $$\begin{aligned} \varvec{\beta }^{(m+1)}&= \left( \frac{1}{n} V_1^{(m+1)\top } X^\top X V_1^{(m+1)} + \frac{\rho _3}{2} I_k \right) ^{-1} \bigg \{ \frac{1}{n} V_1^{(m+1)T} X^\top ({\varvec{y}} - \beta _0^{(m)} {\varvec{1}}_n) \\&\quad + \frac{\rho _3}{2} ({\varvec{\beta }}_0^{(m)} - {\varvec{\lambda }}^{(m)}_3) \bigg \}. \end{aligned}$$
  9. Step 9

    Update \({\varvec{\beta }}_0\) as follows:

    $$\begin{aligned} \beta _{0j}^{(m+1)} = {{\mathcal {S}}} \left( \beta _j^{(m+1)} + \lambda _{3j}^{(m)}, \frac{\lambda _\beta }{\rho _3} \right) , \quad j=1,\ldots ,k, \end{aligned}$$

    where \(\lambda _{3j}^{(m)}\) and \(\beta _j^{(m)}\) are the j-th elements of the vectors \({\varvec{\lambda }}_3^{(m)}\) and \({\varvec{\beta }}^{(m)}\), respectively.

  10. Step 10

    Update \(\beta _0\) as follows:

    $$\begin{aligned} \beta _{0}^{(m+1)} = \frac{1}{n} {\varvec{1}}_n^\top ( {\varvec{y}} - X V_1^{(m+1)} {\varvec{\beta }}^{(m+1)} ). \end{aligned}$$
  11. Step 11

    Update \({\Lambda }_1,{\Lambda }_2,{\varvec{\lambda }}_3\) as follows:

    $$\begin{aligned} \Lambda _1^{(m+1)}&= \Lambda _1^{(m)} + V^{(m+1)} - V_0^{(m+1)},\\ \Lambda _2^{(m+1)}&= \Lambda _2^{(m)} + V_1^{(m+1)} - V_0^{(m+1)},\\ {\varvec{\lambda }}_3^{(m+1)}&= {\varvec{\lambda }}_3^{(m)} + {\varvec{\beta }}^{(m+1)} - {\varvec{\beta }}^{(m+1)}_0. \end{aligned}$$

The derivations of the updates are given in “Appendix A”.

To apply LADMM to the minimization problem (6), we consider the following problem:

$$\begin{aligned}&\min _{\beta _0, {\varvec{\beta }}, {\varvec{\beta }}_0, Z, V, V_0} \left\{ \frac{1}{n} \Vert {\varvec{y}} - \beta _0 {\varvec{1}}_n - X V_0 {\varvec{\beta }} \Vert _2^2 + \frac{w}{n} \Vert X - Z V^\top \Vert _F^2 + \lambda _V \Vert V_0 \Vert _1 + \lambda _{\varvec{\beta }} \Vert \varvec{\beta }_0 \Vert _1 \right\} \nonumber \\&\mathrm{subject \ to} \quad V^\top V = I_k, \quad V=V_0, \quad \varvec{\beta }= \varvec{\beta }_0. \end{aligned}$$
(8)

The augmented Lagrangian for this problem is given by

$$\begin{aligned}&\frac{1}{n} \Vert {\varvec{y}} - \beta _0 {\varvec{1}}_n - X V_0 {\varvec{\beta }} \Vert _2^2 + \frac{w}{n} \Vert X - Z V^\top \Vert _F^2 + \lambda _V \Vert V_0 \Vert _1 + \lambda _{\varvec{\beta }} \Vert \varvec{\beta }_0 \Vert _1 \\&+ \frac{\rho _1}{2} \Vert V_0 - V + \Lambda \Vert _F^2 + \frac{\rho _2}{2} \Vert \varvec{\beta }- {\varvec{\beta }}_0 + {\varvec{\lambda }} \Vert _2^2 \\&\mathrm{subject \ to} \quad V^\top V = I_k, \end{aligned}$$

where \(\Lambda , {\varvec{\lambda }}\) are dual variables and \(\rho _1, \rho _2 \ (>0)\) are penalty parameters.

The updates of the LADMM algorithm are almost the same as those of the ADMM algorithm. We summarize the updates and the derivations in “Appendix B”.

Here we remark on the main differences between ADMM and LADMM. LADMM has two penalty parameters (\(\rho _1,\rho _2\)), while ADMM has three penalty parameters (\(\rho _1,\rho _2,\rho _3\)). This means that the total number of tuning parameters in LADMM is only one less than that in ADMM. This is an advantage of LADMM regardless of whether the user tunes the penalty parameters subjectively or objectively. On the other hand, approximation by Taylor expansion is used in LADMM. If this approximation is inappropriate, LADMM may fail to estimate parameters. In terms of running times, ADMM seems to be faster than LADMM, based on several numerical studies. These results will be presented in Sect. 6 when discussing the limitations of the current study.

Determination of tuning parameters

We have the six tuning parameters: \(w, \lambda _V, \lambda _{\varvec{\beta }}, \rho _1, \rho _2, \rho _3\). The penalty parameters \(\rho _1, \rho _2, \rho _3\) are fixed as \(\rho _1=\rho _2=\rho _3=1\) in accordance with Boyd et al. (2011). The tuning parameter w is set according to the purpose of the analysis. A small value is allocated to w when the user considers the regression loss to be more important than the PCA loss. This idea follows Kawano et al. (2015, 2018).

The two regularization parameters \(\lambda _V, \lambda _{\varvec{\beta }}\) are objectively selected by K-fold cross-validation. For the original dataset divided into the K datasets \(({\varvec{y}}^{(1)}, X^{(1)}), \ldots , ({\varvec{y}}^{(K)}, X^{(K)})\), the criterion for the K-fold cross-validation in ADMM is given by

$$\begin{aligned} \mathrm{CV} = \frac{1}{K} \sum _{k=1}^K \frac{1}{n} \left\| {\varvec{y}}^{(k)} - {\hat{\beta }}_0^{(-k)} {\varvec{1}}_{(k)} - X^{(k)} {\hat{V}}_1^{(-k)} \hat{\varvec{\beta }}^{(-k)} \right\| _2^2, \end{aligned}$$
(9)

where \({\hat{\beta }}_0^{(-k)}, {\hat{V}}_1^{(-k)}, \hat{\varvec{\beta }}^{(-k)}\) are the estimates of \({\beta }_0, {V}_1, {\varvec{\beta }}\), respectively, computed with the data excluding the k-th dataset. We omit the CV criterion for LADMM, since we only replace \({\hat{V}}_1^{(-k)}\) in (9) with \({\hat{V}}_0^{(-k)}\).

We choose the values of the regularization parameters \(\lambda _V, \lambda _{\varvec{\beta }}\) from the minimizers of CV in (9).

Numerical study

Monte Carlo simulations

We conducted Monte Carlo simulations to investigate the effectiveness of SPCRsvd. The simulations had six cases, which were the same as those in Kawano et al. (2015) except for Case 6. These six cases are given as follows.

  1. Case 1

    The 10-dimensional covariate vector \({\varvec{x}}=(x_1,\ldots ,x_{10})\) follows a multivariate normal distribution having a zero mean vector and variance-covariance matrix \(\varSigma \). The response was obtained by

    $$\begin{aligned} {y_i = \zeta _1 {\varvec{e}}_1^\top {\varvec{x}}_i + \zeta _2 {\varvec{e}}_2^\top {\varvec{x}}_i + \varepsilon _i}, \quad i=1,\ldots ,n, \end{aligned}$$

    where \({\varvec{e}}_1 = (1,\underbrace{0,\ldots ,0}_{9})^\top \), \({\varvec{e}}_2 = (0,1,\underbrace{0,\ldots ,0}_{8})^\top \), and \(\varepsilon _i\) are independently distributed as a normal distribution with mean zero and variance \(\sigma ^2\). We used \(\zeta _1=2, \zeta _2=1, \varSigma =I_{10}\). Then we note that \({\varvec{e}}_1\) and \({\varvec{e}}_2\) are eigenvectors of \(\varSigma \).

  2. Case 2

    This case is the same as Case 1 except with \(\zeta _1=8, \zeta _2=1, \varSigma =\mathrm{diag} (1,3^2,\underbrace{1,\ldots ,1}_{8})\). Then \({\varvec{e}}_2\) becomes the first eigenvector. In addition, \(\mathrm {Cov}(y, x_1)=8\) and \(\mathrm {Cov}(y, x_2)=9\). For more details of this setting, we refer to p. 196 in Kawano et al. (2015).

  3. Case 3

    The 20-dimensional covariate vector \({\varvec{x}}=(x_1,\ldots ,x_{20})\) has multivariate normal distribution \(N_{20}({\varvec{0}}, \varSigma )\). The response was obtained as

    $$\begin{aligned} y_i = 4 {\varvec{\zeta }}^\top {\varvec{x}}_i + \varepsilon _i, \quad i=1,\ldots ,n, \end{aligned}$$

    where \(\varepsilon _i\) are independently distributed as \(N(0,\sigma ^2)\). We used \({\varvec{\zeta }}=({\varvec{\nu }}, \underbrace{0,\ldots ,0}_{11})^\top \) and \(\varSigma =\mathrm{block \ diag}(\varSigma _1, I_{11})\), where \({\varvec{\nu }}=(-1,0,1,1,0,-1,-1,0,1)\) and \(\left( \varSigma _1 \right) _{ij}=0.9^{|i-j|} \ (i,j,=1,\ldots ,9)\). Note that \({\varvec{\nu }}\) is a sparse approximation of the fourth eigenvector of \(\varSigma _1\). This case deals with the situation where the response is associated with the fourth principal component.

  4. Case 4

    The 30-dimensional covariate vector \({\varvec{x}}=(x_1,\ldots ,x_{30})\) has multivariate normal distribution \(N_{30}({\varvec{0}}, \varSigma )\). The response was obtained as

    $$\begin{aligned} y_i = 4 {\varvec{\zeta }}_1^\top {\varvec{x}}_i + 4 {\varvec{\zeta }}_2^\top {\varvec{x}}_i + \varepsilon _i, \quad i=1,\ldots ,n, \end{aligned}$$

    where \(\varepsilon _i\) are independently distributed as \(N(0,\sigma ^2)\). We used \({\varvec{\zeta }}_1=({\varvec{\nu }}_1, \underbrace{0,\ldots ,0}_{21})^\top , \) \({\varvec{\zeta }}_2=(\underbrace{0,\ldots ,0}_{9},{\varvec{\nu }}_2, \underbrace{0,\ldots ,0}_{15})^\top , \varSigma =\mathrm{block \ diag}(\varSigma _1, \varSigma _2, I_{15})\). Here \({\varvec{\nu }}_1=(-1,0, 1,1,0,-1,-1,0,1), {\varvec{\nu }}_2=(\underbrace{1,\ldots ,1}_{6})\), and \(\left( \varSigma _2 \right) _{ij}=0.9^{|i-j|} \ (i,j,=1,\ldots ,6)\). Note that \({\varvec{\nu }}_1\) is a sparse approximation of the third eigenvector of \(\varSigma _1\) and \({\varvec{\nu }}_2\) is the first eigenvector of \(\varSigma _2\). This case deals with the situation where the response is associated with the third principal component from \(\varSigma _1\) and the first principal component from \(\varSigma _2\).

  5. Case 5

    This case is the same as Case 4 except with \(\varvec{\nu }_2=(1,0,-1,-1,0,1)\). Note that \({\varvec{\nu }}_2\) is a sparse approximation of the third eigenvector of \(\varSigma _2\). This case deals with the situation where the response is associated with the third principal components from \(\varSigma _1\) and \(\varSigma _2\).

  6. Case 6

    This case is the same as Case 2 except with \(\varvec{x} =(x_1,\ldots ,x_{100})\). This is a high-dimensional case of Case 2.

The sample size was set to \(n=50, 200\). The standard deviation was set to \(\sigma =1,2\). We considered the two algorithms given in Sect. 4.1: ADMM for SPCRsvd (SPCRsvd-ADMM) and LADMM for SPCRsvd (SPCRsvd-LADMM). SPCRsvd was fitted to the simulated data with one or five components \((k=1,5)\) except for Case 6 and one or two components \((k=1,2)\) for Case 6. We set the value of the tuning parameter w to 0.1 and employed five-fold cross-validation for selecting the regularization parameters \(\lambda _V\), \(\lambda _{\varvec{\beta }}\). We used a two-dimensional grid and evaluated the CV in (9) on the grid, as illustrated in Fig. 1. The cross-validation surface was obtained by SPCRsvd-ADMM with \(k=1\) and was estimated by data generated from Case 1 with \(n=50\), \(\sigma =1\). The minimum is achieved for the combination of the first candidate of \(\lambda _V\) and the seventh candidate of \(\lambda _{\varvec{\beta }}\).

Fig. 1
figure1

Cross-validation surface in SPCRsvd-ADMM estimated by data generated from Case 1

SPCRsvd was compared with SPCR, PCR, SPLS, and PLS. SPCR was computed by the package spcr, SPLS by spls, and PLS and PCR by pls. These packages are included in the software R (R Core Team 2020). We used the default settings of the packages when determining the values of tuning parameters in SPCR, PCR, SPLS, and PLS. The values of the tuning parameters w and \(\xi \) in SPCR were set to 0.1 and 0.01, respectively, and then the regularization parameters were selected by five-fold cross-validation. The value of the regularization parameter in SPLS was selected by 10-fold cross-validation. The number of components in SPLS, PLS, and PCR was also selected by 10-fold cross-validation from ranges from one to five when SPCRsvd-ADMM, SPCRsvd-LADMM, and SPCR employ \(k=5\) and from one to two when \(k=2\). The performance was evaluated in terms of \(\mathrm{MSE}=E[ (y-{\hat{y}})^2 ]\). The simulation was conducted 100 times. MSE was estimated from 1,000 random samples.

We summarize the means and standard deviations of MSEs in Tables 1, 2, 3, 4, 5 and 6. The results for \(\sigma =1,2\) had similar tendencies. PCR and PLS were worst in almost all cases, so we will focus on comparing the other methods. SPCRsvd-LADMM and SPCRsvd-ADMM were competitive with SPCR. In particular, SPCRsvd-LADMM and SPCRsvd-ADMM provided smaller MSEs than SPCR in almost all cases when \(k=1\). Compared to SPLS, SPCRsvd-LADMM and SPCRsvd-ADMM were slightly inferior in many cases when \(k=5\). However, SPLS produced so large values of MSEs in many cases when \(k=1\).

Table 1 Mean (standard deviation) of MSE for Case 1
Table 2 Mean (standard deviation) of MSE for Case 2
Table 3 Mean (standard deviation) of MSE for Case 3
Table 4 Mean (standard deviation) of MSE for Case 4
Table 5 Mean (standard deviation) of MSE for Case 5
Table 6 Mean (standard deviation) of MSE for Case 6
Table 7 Mean (standard deviation) of TPR, TNR, and MCC for Case 1

The true positive rate (TPR), the true negative rate (TNR), and the Matthews correlation coefficient (MCC) (Matthews 1975) were also computed for SPCRsvd-LADMM, SPCRsvd-ADMM, SPCR, and SPLS. TPR and TNR are respectively defined by

$$\begin{aligned} \mathrm {TPR}&= \frac{\mathrm {TP}}{ \left| \left\{ j:\zeta ^{*}_{j}\ne 0 \right\} \right| } =\frac{1}{100} \sum _{k=1}^{100} \frac{\left| \left\{ j:{\hat{\zeta }}^{(k)}_{j}\ne 0 \wedge \zeta ^{*}_{j}\ne 0 \right\} \right| }{\left| \left\{ j:\zeta ^{*}_{j}\ne 0 \right\} \right| }, \\\mathrm {TNR}&=\frac{\mathrm {TN}}{ \left| \left\{ j:\zeta ^{*}_{j} = 0 \right\} \right| } = \frac{1}{100} \sum _{k=1}^{100} \frac{\left| \left\{ j:{\hat{\zeta }}^{(k)}_{j}=0 \wedge \zeta ^{*}_{j}=0 \right\} \right| }{\left| \left\{ j:\zeta ^{*}_{j}= 0 \right\} \right| }, \end{aligned}$$

where \(\mathrm {TP}=\sum _{k=1}^{100}| \{ j:{\hat{\zeta }}^{(k)}_{j}\ne 0 \wedge \zeta ^{*}_{j}\ne 0 \}|/100\), \(\mathrm {TN}=\sum _{k=1}^{100}| \{ j:{\hat{\zeta }}^{(k)}_{j}=0 \wedge \zeta ^{*}_{j}= 0 \}|/100\), \({\zeta }^{*}_{j}\) is the true j-th coefficient, \({\hat{\zeta }}^{(k)}_{j}\) is the estimated j-th coefficient for the k-th simulation, and \(|\{*\}|\) is the number of elements included in set \(\{*\}\). MCC is defined by

$$\begin{aligned} \text {MCC} = \frac{ \text {TP} \times \text {TN} - \text {FP} \times \text {FN} }{ \sqrt{ (\text {TP}+\text {FP}) (\text {TP}+\text {FN}) (\text {TN}+\text {FP}) (\text {TN}+\text {FN}) } }, \end{aligned}$$

where \(\mathrm {FP}=\sum _{k=1}^{100}| \{ j:{\hat{\zeta }}^{(k)}_{j}\ne 0 \wedge \zeta ^{*}_{j}= 0 \}|/100\) and \(\mathrm {FN}=\sum _{k=1}^{100}| \{ j:{\hat{\zeta }}^{(k)}_{j}=0 \wedge \zeta ^{*}_{j}\ne 0 \}|/100\).

Table 7 represents the means and standard deviations of TPR, TNR, and MCC for Case 1. Many methods provided higher TPR, whereas SPCR sometimes did not. SPLS provided the highest TNR and MCC among the methods in all situations. For all cases, these tendencies for TPR and TNR were essentially unchanged, while SPCRsvd-ADMM sometimes provided the highest ratios of MCC. The results from Cases 2 to 6 are shown in the supplementary material.

We also investigated the sensitivity of the tuning parameter w and the penalty parameters. Table 8 shows MSEs for SPCRsvd with \(w=1,0.5,0.01\). Note that we could not compute MSEs for \(w=0.01\) in Case 6. Table 9 shows MSEs for SPCRsvd with \(\rho =1.5, 0.5\), where \(\rho =\rho _1=\rho _2\) for SPCRsvd-LADMM and \(\rho =\rho _1=\rho _2=\rho _3\) for SPCRsvd-ADMM. Note that the number of iterations of the simulation was 10 times and we set \(n=50\), \(\sigma =1\), and \(k=1\) in both settings. From the results, we observe that varying w has little influence on MSEs in SPCRsvd-ADMM, whereas it has a small influence in SPCRsvd-LADMM. For the penalty parameters, we observe that varying \(\rho \) has a small influence on MSEs (in particular, Case 6 seems to be affected by \(\rho \).). However, we note that the influences do not essentially change the conclusions derived from Tables 1, 2, 3, 4, 5 and 6 in almost all cases. This means that MSEs of SPCRsvd may be relatively insensitive to w and \(\rho \).

Table 8 Mean (standard deviation) of MSE for \(w=1,0.5,0.01\)
Table 9 Mean (standard deviation) of MSE for \(\rho =1.5,0.5\), where we set \(\rho =\rho _1=\rho _2\) for SPCRsvd-LADMM and \(\rho =\rho _1=\rho _2=\rho _3\) for SPCRsvd-ADMM

Real data analyses

Table 10 Sample size and number of covariates in real datasets

We applied SPCRsvd to real datasets. Specifically, we applied it to eight real datasets (housing, communities, concrete, diabetes, parkinsons, triazines, winequality-red, and winequality-white), which are available from the UCI database (http://archive.ics.uci.edu/ml/index.html). The sample sizes and the numbers of covariates are listed in Table 10. If the sample size was larger than 1100, we randomly extracted 1,100 observations from the dataset. For each dataset, we randomly selected 100 observations as training data and used the remaining as test data to estimate MSEs. We standardized the covariates for each dataset. We applied two algorithms: SPCRsvd-LADMM and SPCRsvd-ADMM. The procedure was repeated 50 times.

Table 11 Mean (standard deviation) of MSE for real datasets

We compared SPCRsvd with the four methods used in Sect. 5.1. The number of components was set as \(k=1\). The value of the tuning parameter w in SPCRsvd was set to 0.01, and then \(\lambda _V\) and \(\lambda _{\varvec{\beta }}\) were selected by five-fold cross-validation. The tuning parameters in the other methods were selected in similar manners to in Sect. 5.1.

Table 11 lists the means and standard deviations of MSEs. PLS and PCR were competitive but did not provide the smallest MSEs for any dataset. SPCR was slightly better than PLS and PCR. SPCRsvd-LADMM and SPCRsvd-ADMM provided smaller MSEs than the other methods in many cases. Although SPLS sometimes provided smaller MSEs than other methods, SPLS also had the worst MSEs in some cases. From the result, we may conclude that SPCRsvd-LADMM and SPCRsvd-ADMM are superior to the other methods in terms of giving smaller MSEs, which is consistent with the results in Sect. 5.1.

Conclusions

In this paper, we proposed SPCRsvd, a one-stage procedure for PCR with a loss function for regression loss and PCA loss of SVD. To obtain the estimates of the parameters in SPCRsvd, we developed two computational algorithms based on ADMM and LADMM. From our numerical studies, we observed that our one-stage method is competitive with or better than competing approaches.

A major limitation of SPCRsvd is the computational cost. Figure 2 shows common logarithm of the run-times for the simulation presented in Sect. 5.1. Note that the number of iterations of the simulation was 10 times and we set \(n=50\), \(\sigma =1\), and \(k=1\). In these results, we observe that SPCRsvd-ADMM was faster than SPCRsvd-LADMM, and that the SPCRsvd-based methods required more computation time than the other four methods in almost cases. This high computational cost causes some problems. For example, SPCRsvd provides relatively low TNR, based on Table 7. To address this issue, one could apply the adaptive lasso to the regularization term in SPCRsvd. However, owing to the computational cost, it may be difficult to perform SPCRsvd with the adaptive lasso because the adaptive lasso generally requires more computation time than lasso.

Fig. 2
figure2

Common logarithm of run-times (seconds) for the simulation in Sect. 5.1

SPCRsvd cannot handle binary data for the explanatory variables. To perform PCA for binary data, Lee et al. (2010) introduced the logistic PCA with sparse regularization. It would be interesting to extend SPCRsvd in the context of the method in Lee et al. (2010). We leave them as future research.

References

  1. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J et al (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends® Mach Learn 3(1):1–122

    MATH  Google Scholar 

  2. Bresler G, Park SM, Persu M (2018) Sparse PCA from sparse linear regression. In: Advances in Neural Information Processing Systems, pp. 10942–10952

  3. Chang X, Yang H (2012) Combining two-parameter and principal component regression estimators. Stat Pap 53(3):549–562

    MathSciNet  MATH  Article  Google Scholar 

  4. Chen S, Ma S, Xue L, Zou H (2020) An alternating manifold proximal gradient method for sparse principal component analysis and sparse canonical correlation analysis. Inf J Optim 2(3):192–208

    MathSciNet  Google Scholar 

  5. Choi J, Zou H, Oehlert G (2010) A penalized maximum likelihood approach to sparse factor analysis. Stat Interf 3(4):429–436

    MathSciNet  MATH  Article  Google Scholar 

  6. Chun H, Keleş S (2010) Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J R Stat Soc Ser B 72(1):3–25

    MathSciNet  MATH  Article  Google Scholar 

  7. Danaher P, Wang P, Witten DM (2014) The joint graphical lasso for inverse covariance estimation across multiple classes. J R Stat Soc Ser B 76(2):373–397

    MathSciNet  Article  Google Scholar 

  8. d’Aspremont A, El Ghaoui L, Jordan MI, Lanckriet GR (2007) A direct formulation for sparse PCA using semidefinite programming. SIAM Rev 49(3):434–448

    MathSciNet  MATH  Article  Google Scholar 

  9. Dicker LH, Foster DP, Hsu D et al (2017) Kernel ridge versus principal component regression: minimax bounds and the qualification of regularization operators. Electron J Stat 11(1):1022–1047

    MathSciNet  MATH  Article  Google Scholar 

  10. Donoho DL, Johnstone JM (1994) Ideal spatial adaptation by wavelet shrinkage. Biometrika 81(3):425–455

    MathSciNet  MATH  Article  Google Scholar 

  11. Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499

    MathSciNet  MATH  Article  Google Scholar 

  12. Erichson NB, Zheng P, Manohar K, Brunton SL, Kutz JN, Aravkin AY (2020) Sparse principal component analysis via variable projection. SIAM J Appl Math 80(2):977–1002

    MathSciNet  MATH  Article  Google Scholar 

  13. Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

    MathSciNet  MATH  Article  Google Scholar 

  14. Febrero-Bande M, Galeano P, González-Manteiga W (2017) Functional principal component regression and functional partial least-squares regression: an overview and a comparative study. Int Stat Rev 85(1):61–83

    MathSciNet  Article  Google Scholar 

  15. Frank LE, Friedman JH (1993) A statistical view of some chemometrics regression tools. Technometrics 35(2):109–135

    MATH  Article  Google Scholar 

  16. Friedman J, Hastie T, Höfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1(2):302–332

    MathSciNet  MATH  Article  Google Scholar 

  17. Hartnett M, Lightbody G, Irwin G (1998) Dynamic inferential estimation using principal components regression (PCR). Chemom Intell Lab Syst 40(2):215–224

    Article  Google Scholar 

  18. Jennrich RI (2006) Rotation to simple loadings using component loss functions: the oblique case. Psychometrika 71(1):173–191

    MathSciNet  MATH  Article  Google Scholar 

  19. Jolliffe IT (1982) A note on the use of principal components in regression. Appl Stat 31(3):300–303

    Article  Google Scholar 

  20. Jolliffe IT (2002) Principal component analysis. Wiley Online Library, New York

    Google Scholar 

  21. Kawano S, Fujisawa H, Takada T, Shiroishi T (2015) Sparse principal component regression with adaptive loading. Comput Stat Data Anal 89:192–203

    MathSciNet  MATH  Article  Google Scholar 

  22. Kawano S, Fujisawa H, Takada T, Shiroishi T (2018) Sparse principal component regression for generalized linear models. Comput Stat Data Anal 124:180–196

    MathSciNet  MATH  Article  Google Scholar 

  23. Lee S, Huang JZ, Hu J (2010) Sparse logistic principal components analysis for binary data. Ann Appl Stat 4(3):1579–1601

    MathSciNet  MATH  Article  Google Scholar 

  24. Li X, Mo L, Yuan X, Zhang J (2014) Linearized alternating direction method of multipliers for sparse group and fused lasso models. Comput Stat Data Anal 79:203–221

    MathSciNet  MATH  Article  Google Scholar 

  25. Ma S, Huang J (2017) A concave pairwise fusion approach to subgroup analysis. J Am Stat Assoc 112(517):410–423

    MathSciNet  Article  Google Scholar 

  26. Massy WF (1965) Principal components regression in exploratory statistical research. J Am Stat Assoc 60(309):234–256

    Article  Google Scholar 

  27. Matthews BW (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405(2):442–451

    Article  Google Scholar 

  28. Pearson K (1901) On lines and planes of closest fit to systems of point in space. Philos Mag 2:559–572

    MATH  Article  Google Scholar 

  29. Price BS, Geyer CJ, Rothman AJ (2019) Automatic response category combination in multinomial logistic regression. J Comput Graph Stat 28(3):758–766

    MathSciNet  Article  Google Scholar 

  30. R Core Team (2020) R : A language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria (2020). https://www.R-project.org/

  31. Reiss PT, Ogden RT (2007) Functional principal component regression and functional partial least squares. J Am Stat Assoc 102(479):984–996

    MathSciNet  MATH  Article  Google Scholar 

  32. Rosipal R, Girolami M, Trejo LJ, Cichocki A (2001) Kernel PCA for feature extraction and de-noising in nonlinear regression. Neural Comput Appl 10(3):231–243

    MATH  Article  Google Scholar 

  33. Shen H, Huang JZ (2008) Sparse principal component analysis via regularized low rank matrix approximation. J Multivar Anal 99(6):1015–1034

    MathSciNet  MATH  Article  Google Scholar 

  34. Tan K, London P, Mohan K, Lee S, Fazel M, Witten D (2014) Learning graphical models with hubs. J Mach Learn Res 15:3297–3331

    MathSciNet  MATH  Google Scholar 

  35. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  36. Vu VQ, Cho J, Lei J, Rohe K (2013) Fantope projection and selection: A near-optimal convex relaxation of sparse PCA. In: Advances in neural information processing systems, pp. 2670–2678

  37. Wang B, Zhang Y, Sun WW, Fang Y (2018) Sparse convex clustering. J Comput Graph Stat 27(2):393–403

    MathSciNet  Article  Google Scholar 

  38. Wang K, Abbott D (2008) A principal components regression approach to multilocus genetic association studies. Genet Epidemiol 32(2):108–118

    Article  Google Scholar 

  39. Wang X, Yuan X (2012) The linearized alternating direction method for dantzig selector. SIAM J Sci Comput 34(5):A2792–A2811

    MathSciNet  MATH  Article  Google Scholar 

  40. Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3):515–534

    MATH  Article  Google Scholar 

  41. Wold H (1975) Soft modeling by latent variables: the nonlinear iterative partial least squares approach. Perspectives in probability and statistics, papers in honour of MS Bartlett pp. 520–540

  42. Wu TT, Lange K (2008) Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat 2(1):224–244

    MathSciNet  MATH  Article  Google Scholar 

  43. Yan X, Bien J (2020) Rare feature selection in high dimensions. J Am Stat Assoc (accepted) pp 1–30

  44. Ye GB, Xie X (2011) Split bregman method for large scale fused lasso. Comput Stat Data Anal 55(4):1552–1569

    MathSciNet  MATH  Article  Google Scholar 

  45. Zhang CH et al (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38(2):894–942

    MathSciNet  MATH  Article  Google Scholar 

  46. Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429

    MathSciNet  MATH  Article  Google Scholar 

  47. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67(2):301–320

    MathSciNet  MATH  Article  Google Scholar 

  48. Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat 15(2):265–286

    MathSciNet  Article  Google Scholar 

  49. Zou H, Xue L (2018) A selective overview of sparse principal component analysis. Proce IEEE 106(8):1311–1320

    Article  Google Scholar 

Download references

Acknowledgements

The author thanks the reviewers for their helpful comments and constructive suggestions. This work was supported by JSPS KAKENHI Grant Numbers JP19K11854 and JP20H02227, and MEXT KAKENHI Grant Numbers JP16H06429, JP16K21723, and JP16H06430.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Shuichi Kawano.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 22 KB)

Appendix

Appendix

A    Derivation of the updates in the ADMM algorithm

By simple calculations, we can easily obtain the solutions for \(\beta _0, \Lambda _1, \Lambda _2, {\varvec{\lambda }}_3\). Hence we show only the derivations for \(V_1, V, V_0, Z, {\varvec{\beta }}, {\varvec{\beta }}_0\). For simplicity, we omit iteration index m.

  • Update of \(V_1\).

    $$\begin{aligned} V_1 := \mathop {\mathrm{arg~min}}\limits _{V_1} \left\{ \frac{1}{n} \Vert {\varvec{y}} - \beta _0 {\varvec{1}}_n - X V_1 {\varvec{\beta }} \Vert _2^2 + \frac{\rho _2}{2} \left\| V_1 - V_0 + \Lambda _2 \right\| _F^2 \right\} . \end{aligned}$$

    Set \({\varvec{y}}^* = {\varvec{y}} - \beta _0 {\varvec{1}}_n\). The terms on the right-hand side are calculated by

    $$\begin{aligned} \Vert {\varvec{y}}^* - X V_1 {\varvec{\beta }} \Vert _2^2&= {\varvec{y}}^{*\top } {\varvec{y}}^* - 2 \mathrm{tr} ({\varvec{\beta }} {\varvec{y}}^{*\top } X V_1) + {\varvec{\beta }}^\top V_1^\top X^\top X V_1 {\varvec{\beta }}, \\ \Vert V_1 - V_0 + \Lambda _2 \Vert _F^2&= \mathrm{tr} (V_1^\top V_1) - 2 \mathrm{tr} \{ (V_0-\Lambda _2)^\top V_1\} + \mathrm{tr}\{ (V_0-\Lambda _2)^\top (V_0-\Lambda _2) \}. \end{aligned}$$

    Using these, we obtain

    $$\begin{aligned} {{\mathcal {F}}}&:= \ \frac{1}{n} \Vert {\varvec{y}}^* - X V_1 {\varvec{\beta }} \Vert _2^2 + \frac{\rho _2}{2} \left\| V_1 - V_0 + \Lambda _2 \right\| _F^2 \\&= \ \frac{1}{n} {\varvec{\beta }}^\top V_1^\top X^\top X V_1 {\varvec{\beta }} - \frac{2}{n} \mathrm{tr} ({\varvec{\beta }} {\varvec{y}}^{*\top } X V_1) \\&\quad + \frac{\rho _2}{2} \mathrm{tr} (V_1^\top V_1) - \rho _2 \mathrm{tr} \{ (V_0-\Lambda _2)^\top V_1\} + C, \end{aligned}$$

    where C is a constant. Setting \(\partial {{\mathcal {F}}} / \partial V_1 = {\varvec{O}}\), we have

    $$\begin{aligned} \frac{2}{n} X^\top X V_1 {\varvec{\beta }} {\varvec{\beta }}^\top - \frac{2}{n} X^\top {\varvec{y}}^* {\varvec{\beta }}^\top + \rho _2 V_1 - \rho _2 (V_0 - \Lambda _2) = \varvec{O}. \end{aligned}$$

    This leads to the update for \(V_1\).

  • Update of V.

    $$\begin{aligned} V := \mathop {\mathrm{arg~min}}\limits _{V} \left\{ \frac{w}{n} \Vert X - Z V^\top \Vert _F^2 + \frac{\rho _1}{2} \Vert V - V_0 + \Lambda _1 \Vert _F^2 \right\} \ \ \mathrm{subject \ to} \ \ V^\top V = I_k. \end{aligned}$$

    The terms on the right-hand side are calculated by

    $$\begin{aligned} \Vert X - Z V^\top \Vert _F^2&= \mathrm{tr} (X^\top X) - 2 \mathrm{tr} (V Z^\top X) + \mathrm{tr}(Z^\top Z), \\ \Vert V - V_0 + \Lambda _1 \Vert _F^2&= -2 \mathrm{tr} \{ (V_0-\Lambda _1)^\top V\} + \mathrm{tr}\{ (V_0-\Lambda _1)^\top (V_0-\Lambda _1) \} + k. \end{aligned}$$

    Adding the equality constraint \(V^\top V = I_k\), we obatin

    $$\begin{aligned}&\mathop {\mathrm{arg~min}}\limits _{V} \left\{ \frac{w}{n} \Vert X - Z V^\top \Vert _F^2 + \frac{\rho _1}{2} \Vert V - V_0 + \Lambda _1 \Vert _F^2 \right\} \\&\quad = \mathop {\mathrm{arg~min}}\limits _{V} \left\{ \left\| V - \left\{ \frac{w}{n} X^\top Z + \frac{\rho _1}{2} (V_0 - \Lambda _1) \right\} \right\| _F^2 \right\} . \end{aligned}$$

    From the SVD \({w} X^\top Z/n + {\rho _1}\left( V_0 - \Lambda _1 \right) /2 = P \varOmega Q^\top \), we obtain the solution \(V = P Q^\top \). This follows from the Procrustes rotation by Zou et al. (2006).

  • Update of \(V_0\).

    $$\begin{aligned} V_0\! :=\! \mathop {\mathrm{arg~min}}\limits _{V_0} \left\{ \frac{\rho _1}{2} \Vert V - V_0 + \Lambda _1 \Vert _F^2 \!+\! \frac{\rho _2}{2} \Vert V_1\! -\! V_0 + \Lambda _2 \Vert _F^2 + \lambda _V \Vert V_0 \Vert _1 \right\} .\nonumber \\ \end{aligned}$$
    (A.1)

    A simple calculation shows that the first two terms on the right-hand side are given by

    $$\begin{aligned} \frac{\rho _1+\rho _2}{2} \left\| V_0 - \frac{1}{\rho _1+\rho _2} \{ \rho _1 (V+\Lambda _1) + \rho _2 (V_1+\Lambda _2) \} \right\| _F^2. \end{aligned}$$

    Formula (A.1) can be rewritten as

    $$\begin{aligned} V_0:= & {} \mathop {\mathrm{arg~min}}\limits _{V_0} \left\{ \frac{1}{2} \left\| V_0 - \frac{1}{\rho _1+\rho _2} \left\{ \rho _1 (V+\Lambda _1) + \rho _2 (V_1+\Lambda _2) \right\} \right\| _F^2\right. \\&\left. + \frac{\lambda _V}{\rho _1+\rho _2} \Vert V_0 \Vert _1 \right\} . \end{aligned}$$

    Thus we obtain the update of \(V_0\).

  • Update of Z.

    $$\begin{aligned} Z := \mathop {\mathrm{arg~min}}\limits _{Z} \left\{ \frac{w}{n} \Vert X - Z V^\top \Vert _F^2 \right\} . \end{aligned}$$

    We have the solution \(Z=X V\) from the first-order optimality condition.

  • Update of \(\varvec{\beta }\).

    $$\begin{aligned} {\varvec{\beta }} := \mathop {\mathrm{arg~min}}\limits _{\varvec{\beta }} \left\{ \frac{1}{n} \Vert {\varvec{y}} - \beta _0 {\varvec{1}}_n - X V_1 {\varvec{\beta }} \Vert _2^2 + \frac{\rho _2}{2} \Vert {\varvec{\beta }} - {\varvec{\beta }}_0 + {\varvec{\lambda }} \Vert _2^2 \right\} . \end{aligned}$$

    The first-order optimality condition is

    $$\begin{aligned} - \frac{2}{n} V_1^\top X^\top ({\varvec{y}} - \beta _0 {\varvec{1}}_n - X V_1 {\varvec{\beta }}) + \rho _2 ({\varvec{\beta }} - {\varvec{\beta }}_0 + {\varvec{\lambda }}) = {\varvec{0}}. \end{aligned}$$

    This leads to the update of \(\varvec{\beta }\).

  • Update of \(\varvec{\beta }_0\).

    $$\begin{aligned} {\varvec{\beta }}_0 := \mathop {\mathrm{arg~min}}\limits _{{\varvec{\beta }}_0} \left\{ \frac{\rho _2}{2} \Vert {\varvec{\beta }} - {\varvec{\beta }}_0 + {\varvec{\lambda }} \Vert _2^2 + \lambda _\beta \Vert {\varvec{\beta }}_0 \Vert _1\right\} . \end{aligned}$$

    It is clear that the update of \({\varvec{\beta }}_0\) can be simply obtained by an element-wise soft-threshold operator.

B    LADMM algorithm for SPCRsvd

The LADMM algorithm for SPCRsvd is as follows:

  • Step 1 Set the values of the tuning parameter w, regularization parameters \(\lambda _V, \lambda _{\varvec{\beta }}\), and penalty parameters \(\rho _1, \rho _2\).

  • Step 2 Initialize all the parameters as \(\beta _0^{(0)}, {\varvec{\beta }}^{(0)}, {\varvec{\beta }}_0^{(0)}, Z^{(0)}, V^{(0)}, V_0^{(0)}, \Lambda ^{(0)},{\varvec{\lambda }}^{(0)}\).

  • Step 3 For \(m=0,1,2,\ldots \), repeat Steps 4 to 10 until convergence.

  • Step 4 Update V as follows:

    $$\begin{aligned} V^{(m+1)}=PQ^\top , \end{aligned}$$

    where P and Q are the matrices given by SVD

    $$\begin{aligned} \frac{w}{n} X^\top Z^{(m)} + \frac{\rho _1}{2} \left( V_0^{(m)} + \Lambda ^{(m)} \right) = P \varOmega Q^\top . \end{aligned}$$
  • Step 5 Update \(V_0\) as follows:

    $$\begin{aligned} v_{0ij}^{(m+1)} = {{\mathcal {S}}} \left( s_{ij}, \lambda _V/\left( \frac{2 \nu + n \rho _1}{n} \right) \right) , \quad i=1,\ldots ,p, \ j=1,\ldots ,k, \end{aligned}$$
    (B.1)

    where \(v_{0ij}^{(m)}=(V_0^{(m)})_{ij}\), \(\nu \) is the maximum eigenvalue of \({\varvec{\beta }}^{(m)} {\varvec{\beta }}^{(m)\top } \otimes X^\top X\), and \(s_{ij}\) is the (ij)-th element of the matrix

    $$\begin{aligned}&\frac{2n}{2 \nu + n \rho _1} \bigg \{\frac{1}{n} \bigg ( X^\top ({\varvec{y}} - \beta _0^{(m)} {\varvec{1}}_n) {\varvec{\beta }}^{(m)\top } - X^\top X V_0^{(m)} {\varvec{\beta }}^{(m)} {\varvec{\beta }}^{(m)\top } ) \\&\quad + \frac{\nu }{n} V_0^{(m)} - \frac{\rho _1}{2} ( \Lambda ^{(m)} - V^{(m+1)} \bigg ) \bigg \}. \end{aligned}$$
  • Step 6 Update Z by \(Z^{(m+1)}=X V^{(m+1)}\).

  • Step 7 Update \({\varvec{\beta }}\) as follows:

    $$\begin{aligned} \varvec{\beta }^{(m+1)}&= \left( \frac{1}{n} V_0^{(m+1)\top } X^\top X V_0^{(m+1)} + \frac{\rho _2}{2} I_k \right) ^{-1} \bigg \{ \frac{1}{n} V_0^{(m+1)\top } X^\top ({\varvec{y}} - \beta _0^{(m)} {\varvec{1}}_n) \\&\quad + \frac{\rho _2}{2} ({\varvec{\beta }}_0^{(m)} - {\varvec{\lambda }}^{(m)}) \bigg \}. \end{aligned}$$
  • Step 8 Update \({\varvec{\beta }}_0\) as follows:

    $$\begin{aligned} \beta _{0j}^{(m+1)} = {{\mathcal {S}}} \left( \beta _j^{(m+1)} + \lambda _{j}^{(m)}, \frac{\lambda _\beta }{\rho _2} \right) , \quad j=1,\ldots ,k, \end{aligned}$$

    where \(\lambda _{j}^{(m)}\) and \(\beta _j^{(m)}\) are the j-th element of the vector \({\varvec{\lambda }}^{(m)}\) and \({\varvec{\beta }}^{(m)}\), respectively.

  • Step 9 Update \(\beta _0\) as follows:

    $$\begin{aligned} \beta _{0}^{(m+1)} = \frac{1}{n} {\varvec{1}}_n ^\top ( {\varvec{y}} - X V_0^{(m+1)} {\varvec{\beta }}^{(m+1)} ). \end{aligned}$$
  • Step 10 Update \({\Lambda },{\varvec{\lambda }}\) as follows:

    $$\begin{aligned} \Lambda ^{(m+1)}&= \Lambda ^{(m)} + V_0^{(m+1)} - V^{(m+1)},\\ {\varvec{\lambda }}^{(m+1)}&= {\varvec{\lambda }}^{(m)} + {\varvec{\beta }}^{(m+1)} - {\varvec{\beta }}^{(m+1)}_0. \end{aligned}$$

Next, we describe only the update of only \(V_0\) because the derivations of other updates are the same as in Appendix A. As in Appendix A, we omit iteration index m.

We consider

$$\begin{aligned} V_0\! := \!\mathop {\mathrm{arg~min}}\limits _{V_0} \left\{ \frac{1}{n} \Vert {\varvec{y}}\! -\! \beta _0 {\varvec{1}}_n\! -\! X V_0 {\varvec{\beta }} \Vert _2^2 + \frac{\rho _1}{2} \Vert V_0 - V+ \Lambda \Vert _F^2 + \lambda _V \Vert V_0 \Vert _1 \right\} .\nonumber \\ \end{aligned}$$
(B.2)

Set \({\varvec{y}}^* = {\varvec{y}} - \beta _0 {\varvec{1}}_n\). By Taylor expansion, the term \(\Vert {\varvec{y}}^* - X V_0 {\varvec{\beta }} \Vert _2^2\) is approximated as

$$\begin{aligned} \Vert {\varvec{y}}^* - X V_0 {\varvec{\beta }} \Vert _2^2&= {\varvec{y}}^{*\top } {\varvec{y}}^* - 2 \mathrm{tr} ( {\varvec{\beta }} {\varvec{y}}^{*\top } X V_0 ) + {\varvec{\beta }}^\top V_0^\top X^\top X V_0 {\varvec{\beta }} \nonumber \\&\approx {\varvec{y}}^{*\top } {\varvec{y}}^* - 2 \mathrm{tr} ( {\varvec{\beta }} {\varvec{y}}^{*\top } X V_0 ) + 2 \mathrm{tr} ( {\varvec{\beta }} {\varvec{\beta }}^\top {\tilde{V}}_0 X^\top X V_0 ) + \nu \Vert V_0 - {\tilde{V}}_0 \Vert _F^2, \end{aligned}$$

where \({\tilde{V}}_0\) is the current estimate of \(V_0\) and \(\nu \) is a constant. Following Li et al. (2014), we use the maximum eigenvalue of \({\varvec{\beta }} {\varvec{\beta }}^\top \otimes X^\top X\) as \(\nu \). Using the approximation, the problem (B.2) can be replaced with

$$\begin{aligned} V_0&:= \mathop {\mathrm{arg~min}}\limits _{V_0} \bigg \{ \underbrace{- \frac{2}{n} \mathrm{tr} ( {\varvec{\beta }} {\varvec{y}}^{*\top } X V_0 ) + \frac{2}{n} \mathrm{tr} ( {\varvec{\beta }} {\varvec{\beta }}^\top {\tilde{V}}_0 X^\top X V_0 ) + \frac{\nu }{n} \Vert V_0 - {\tilde{V}}_0 \Vert _F^2 + \frac{\rho _1}{2} \Vert V_0 - V + \Lambda \Vert _F^2}_{\mathrm{(A)}} \\&\quad + \lambda _V \Vert V_0 \Vert _1 \bigg \}. \end{aligned}$$

Formula (A) is calculated as

$$\begin{aligned} \frac{2 \nu + n \rho _1}{2n} \left\| V_0 - \frac{2n}{2 \nu + n \rho _1} \left\{ \frac{1}{n} (X^\top {\varvec{y}}^* {\varvec{\beta }}^\top - X^\top X {\tilde{V}}_0 {\varvec{\beta }} {\varvec{\beta }}^\top ) + \frac{\nu }{n} {\tilde{V}}_0 - \frac{\rho _1}{2} (\Lambda - V) \right\} \right\| _F^2. \end{aligned}$$

This leads to the update of \(V_0\) given in Formula (B.1).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kawano, S. Sparse principal component regression via singular value decomposition approach. Adv Data Anal Classif (2021). https://doi.org/10.1007/s11634-020-00435-2

Download citation

Keywords

  • ADMM
  • Lasso
  • One-stage procedure
  • Singular value decomposition
  • Principal component analysis

Mathematics Subject Classification

  • 62H25
  • 62J07
  • 62J05