Obtaining insights from high-dimensional data: sparse principal covariates regression
Abstract
Background
Data analysis methods are usually subdivided in two distinct classes: There are methods for prediction and there are methods for exploration. In practice, however, there often is a need to learn from the data in both ways. For example, when predicting the antibody titers a few weeks after vaccination on the basis of genomewide mRNA transcription rates, also mechanistic insights about the effect of vaccinations on the immune system are sought. Principal covariates regression (PCovR) is a method that combines both purposes. Yet, it misses insightful representations of the data as these include all the variables.
Results
Here, we propose a sparse extension of principal covariates regression such that the resulting solutions are based on an automatically selected subset of the variables. Our method is shown to outperform competing methods like sparse principal components regression and sparse partial least squares in a simulation study. Furthermore good performance of the method is illustrated on publicly available data including antibody titers and genomewide transcription rates for subjects vaccinated against the flu: the selected genes by sparse PCovR are higly enriched for immune related terms and the method predicts the titers for an independent test sample well. In comparison, no significantly enriched terms were found for the genes selected by sparse partial least squares and out-of-sample prediction was worse.
Conclusions
Sparse principal covariates regression is a promising and competitive tool for obtaining insights from high-dimensional data.
Availability
The source code implementing our proposed method is available from GitHub, together with all scripts used to extract, pre-process, analyze, and post-process the data: https://github.com/katrijnvandeun/SPCovR.
Keywords
Dimension reduction Prediction High-dimensional data Immunology Stability selectionAbbreviations
- BACH2
Transcription regulator protein BACH2
- CAMK4
Calcium/calmodulin-dependent kinase IV
- HAI
Hemagglutination inhibition assay
- LARS
Least angle regression
- PCA
Principal component analysis
- PCovR
Principal covariates regression
- PCR
Principal components regression
- PLS
Partial least squares
- PP1R11
Protein phosphatase 1 regulatory inhibitor subunit 11
- PRESS
Squared prediction error
- RMA
Robust Multichip Average
- SMUG1
Single-Strand-Selective Monofunctional Uracil-DNA Glycosylase 1
- SPCovR
Sparse principal covariates regression
- SPCR
Sparse principal components regression
- SPLS
Sparse partial least squares
- VAF
Variation accounted for
- VAFX
Proportion of variation accounted for in the covariates
- VAFY
Proportion of variation accounted for in the outcome
Background
Traditionally, data analysis methods are divided in two classes with different goals: Methods for prediction (or, supervised learning) and methods for exploration (or, unsupervised learning). An example of the former is assessing whether someone is at risk for breast cancer; in this case the aim is to use currently available information to predict an unseen (often future) outcome. On the other hand, the goal of exploratory methods is to gain an understanding about the mechanisms that cause structural variation and covariation in the available information. For example, exploration of gene expression data collected over time after addition of serum gave not only insight in the transcriptional program but also in processes related to wound repair [1]. There are many cases, however, where it is of interest to reach both objectives and to predict an outcome of interest while simultaneously revealing the processes at play. This is for example the case in the study of [2]: The gene expression response soon after vaccination and the antibody titers much later in time were measured with the aim of both predicting immunogenicity and revealing new mechanistic insights about vaccines.
To reveal the underlying mechanisms, component or matrix decomposition based methods can be used. Well known examples are principal component analysis (PCA) and the singular value decomposition [3]. Yet, another frequent use of such methods is in the context of prediction with many covariates: A popular approach is to first reduce the covariates to a limited number of components and to subsequently use these for prediction. This is known as principal components regression (PCR, see [4]). A drawback of this two-step approach is that the components are constructed with no account of the prediction problem and hence may miss the components that are relevant in predicting the outcome. This is especially true when the number of predictor variables is huge and represents a large diversity of processes, as is for example the case with genomewide expression data. Sparse regression approaches like the lasso [5] and elastic net [6], on the other hand, only focus on modeling the outcome with no account of the structural variation underlying the covariates. Hence, approaches that find components that simultaneously reveal the underlying mechanisms and model the outcome of interest are needed. Partial least squares (PLS; see for example [7]) and principal covariates regression, PCovR [8], are such methods. Yet, partial least squares may have a too strong focus towards prediction [9] while principal covariates regression can be flexibly tuned to balance between prediction of the outcome and reduction of the covariates to a few components.
Apart from implementation issues (meaning that existing PCovR software can only be used on data with a modest number of variables), a shortcoming of PCovR is that the components are based on a linear combination of all variables. This is undesirable when working with a large set of variables, both from a statistical and an interpretational point of view. First, the estimators are not consistent in the p>n case [10], and second, the interpretation of components based on a high number of variables is infeasible. Furthermore, components that are based on a limited set of selected variables better reflect the fact that many biological processes are governed by a few genes only. To overcome such issues in partial least squares, sparse methods have been developed [10, 11]. Likewise we propose here a sparse and efficient version of principal covariates regression^{1}. The proposed method offers a flexible and promising alternative to sparse partial least squares.
The paper is organized as follows. First we propose the PCovR method and its sparse extension (SPCovR), and we discuss its relation to (sparse) PLS. The (comparative) performance of SPCovR is evaluated in a simulation study and in an application to genomewide expression data collected for persons vaccinated against the flu [2]. The implementation of the SPCovR algorithm is available online (https://github.com/katrijnvandeun/SPCovR) together with the scripts used for analyzing the data.
Methods
Sparse principal covariates regression
We will make use of the following notation: matrices are denoted by bold uppercases, the transpose by the superscript ^{ T }, vectors by bold lowercase, and scalars by lowercase italics. Furthermore, we will use the convention to indicate the cardinality of a running index by the capital of the letter used to run the index (e.g., this paper deals with J variables with j running from 1 to J), see [12].
Formal model
with XW=T the I×R matrix of component scores, P_{ x } the J_{ x }×R matrix of variable loadings, P_{ y } the J_{ y }×R vector of R regression weights, and E_{ x }, E_{ y } the residuals. Note that we consider the general problem of a multivariate outcome, hence R regression coefficients for each of the J_{ y } outcome variables. The component scores are often constrained to be orthogonal : T^{ T }T=I with I an identity matrix of size R×R. Despite this restriction there still is rotational freedom and the PCovR coefficients are not uniquely defined. The data model represented in (1) and (2) is one that summarizes the predictor variables by means of a few components and uses these as the predictors of the outcome. Note that the same model also underlies principal components regression and partial least squares.
Objective function
A solution to (3) based on the singular value decomposition of X was proposed by [13]. An efficient implementation that accounts for large data (either I or J large) can be found in the online code.
under the restrictions. Criterion (6) is approximately equal to \(R_{X}^{2}R_{Y}^{2}\) and can be compared to criterion (4) to obtain some intuition about the similarities and differences between both methods. Given that the PLS and PCovR criteria are different, it can be expected that the obtained estimates are different as well. Whereas PLS cannot be expressed as a special case of PCovR with a particular value of the tuning parameter α, it has been shown to be a special case of continuum regression with the continuum regression parameter set equal to 0.5 [14].
with \(|\mathbf {W}|_{1}=\sum _{j,r} |w_{jr}|\) the lasso penalty and \(|\mathbf {W}|_{2}^{2}=\sum _{j,r} w_{jr}^{2}\) the ridge penalty. λ_{1} (λ_{1}≥0) and λ_{2} (λ_{2}≥0) are tuning parameters for the lasso and ridge penalties respectively. The effect of the lasso is that it shrinks the coefficients, some (or many for high λ_{1}) to exactly zero thus performing variable selection. Note that the lasso penalty in Eq. (7) is imposed only on the component weights and not on the loadings P_{ x } nor on the regression weights P_{ y }. The penalty implies that some or many of the component weights will become zero; because the loadings and regression weights are not subject to the lasso penalty, these are not affected by the penalty. The ridge also introduces shrinkage and is included here for two reasons: To introduce stability in the estimated coefficients and to allow for more than I non-zero coefficients; this combination of penalties is known as the elastic net. Both the lasso and the elastic net are known to over-shrink the non-zero coefficients [15, 16]. One way to undo the shrinkage of the non-zero coefficients, is to re-estimate them using an ordinary least squares approach [17]. When α=1, the objective function (7) reduces to the sparse PCA criterion [18] and the resulting estimates can also be obtained under a sparse principal components regression approach. When α=0 and R=1, the elastic net regression formulation is obtained [6] and the two problems are equivalent. Note that the introduction of the sparseness restriction eliminates the rotational freedom and, under suitable conditions, has a unique solution.
This sparse PLS criterion is different from the SPCovR criterion (7) over the whole range of α. The two methods can be expected to yield different estimates.
Algorithm
The procedure that we will use to estimate the model parameters is one which estimates all R components simultaneously and not -as is often the case in the literature - one by one. The main benefit is that this gives control over the constraints that are imposed on the parameter estimates. More specifically, we offer the choice to constrain the loadings P_{ x } either to be orthogonal or length restricted (\(diag \left (\mathbf {P}_{x}^{T}\mathbf {P}_{x}\right)=\mathbf {1}\)). The former is the usual constraint used in sparse PCA approaches [18], the latter is more flexible and allows for correlated component loadings. Note that the length constraint is needed to avoid trivial solutions where very small component weights that satisfy the penalty are compensated by very high loadings. To solve the optimization problem in (7) under these constraints, we rely on a numerical procedure and alternate between conditional estimation of W given fixed values for P and of P given fixed values for W. For the moment we assume the number of components R and the value of the tuning parameters α, λ_{1}, and λ_{2} to be given; how to tune these meta-parameters is discussed in the next subsection.
The conditional estimation of the weights W is based on a coordinate descent procedure and of the loadings on a restricted least-squares routine; both procedures are detailed in the Appendix. Using these routines, the loss is guaranteed to be non-increasing. Furthermore, because the loss is bounded from below by zero the algorithm converges to a stationary point (for suitable starting values). To deal with the problem of local optima, a multistart procedure can be used. We recommend to use a combination of both a rational and several random starting configurations. A rational start may be obtained from the non-sparse principal covariates regression analysis. Note that often sparse multivariate approaches are initialized by a rational start only which does not account for the problem of local optima.
Tuning and model selection
The sparse PCovR model is estimated with fixed values for the weighting parameter α, the number of components R, the Lasso tuning parameter λ_{1}, and the ridge parameter λ_{2}. The problem that we consider here, is how to tune these meta-parameters. Cross-validation is frequently recommended in the literature but this requires data that are rich in the number of observations. In addition, the computational cost of cross-validation for the SPCovR model is considerable (because all possible combinations of the values for each of the tuning parameters need to be considered). Furthermore, in the context of PCovR, simulation studies showed that this is not a superior model selection strategy compared to strategies that rely on a stepwise approach [9]. Hence, we propose to use a stepwise strategy.
with \(\sigma _{\epsilon _{x}}^{2}\) and \(\sigma _{\epsilon _{y}}^{2}\) the variance of the error on the predictor and outcome variables respectively. In the case of a large number of predictor variables J_{ x } will dominate the expression and we can assume that α will be almost - but not exactly - equal to one without having to estimate the size of the error variances. It is important to keep α strictly smaller than one, for example α=.99, and to use PCovR instead of a PCR approach [19].
Second, we fix the number of components by a so-called scree test that is based on a visual display of the value of the loss function (3) against the number of components r in the model for r=1,...,R. In this display, we look for the point where the plot levels off and select the number of components just before this point.
Next we tune the ridge penalty. We recommend to set λ_{2} equal to a small value to have more emphasis on variable selection by the lasso (for example, 5% of the value of the lasso). This small value should be sufficient to stabilize the estimates and to encourage grouping of strongly correlated variables [20].
The final metaparameter to tune is the lasso parameter λ_{1}. A straightforward and often used procedure to find a proper value for λ_{1} is cross-validation [6]. In the more recent literature it has been established that cross-validation results in selecting a superset of the correct predictors, and thus in false positives (see for example the retrospective paper on the lasso and the included discussions [21]). One proposal to address this issue of falsely selecting variables, is the use of a stability selection procedure [22] which allows to control the false positive error rate. Stability selection is a general method that can be easily applied (in adapted form) with our SPCovR procedure. In brief, it is a resampling based procedure that is used to determine the status of the coefficients (zero or non-zero) over a range of values for the tuning parameter λ_{1}. The original procedure has been proposed for a single set of variable coefficients. Here, we have R such sets due to the fact that we estimate the weights of all components simultaneously. The order of the components between different solutions may change (permutational freedom of the components) and this has to be taken into account.
To explain the stability selection procedure, we start with the for loop in Algorithm 2: Given a fixed value λ_{1}, N resamples are created by drawing with replacement a fraction f of the observations (with.50≤f<1). The resampled data are subjected to a SPCovR analysis and the resulting N matrices of component weights W are used as follows: For each coefficient w_{ jr } the proportion of occurences for which it is non-zero in the N resamples is recorded in the probability matrix Π^{(λ)}. Note that between samples, permutation of the components may occur. We account for this by permuting the components to maximal agreement in the component scores as measured by Tucker’s coefficient of congruence; the component score matrix resulting from the SPCovR analysis of the original (non-resampled) data is used as the reference for congruence.
Hence, by fixing E(V), e.g. to one, and the probability threshold π_{ thr }=0.90 [22], an upper bound on the number of non-zero coefficients is obtained. For a range Λ of λ_{1} values, the non-zero probability is given by Π^{(Stable)}= maxλ∈ΛΠ^{(λ)} and the set of non-zero coefficients by those for which \(\pi _{jr}^{(Stable)}\geq \pi _{thr}\). If q_{ Λ }≤q_{ R } the procedure continues by extending the range of λ_{1} values with the next value. The values of λ_{1} are taken from the interval Λ=[ λ_{ max },...,λ_{ min }] with λ_{ min }=1e^{−4}λ_{ max } and the remaining values equally spaced and arranged in decreasing order between log_{2}(λ_{ max }) and log_{2}(λ_{ min }); see [23].
Results
To compare the performance of sparse principal covariates regression with competing methods, we make use of both synthetically created data in a simulation study and empirical data resulting from a systems biology study on the flu vaccine.
Simulation study
- 1
VAFX: The total proportion of variation accounted for (VAF) by the components in the block of covariates with levels 0.01, 0.40, and 0.70.
- 2
The relative strength of the components in the variation accounted for in the block of covariates VAFX: 0.10 versus 0.90 (the second component is much stronger than the first component; for example with VAFX =0.40 the first component accounts for 4 percent of the total variation and the second one for 36 percent), 0.50 versus 0.50 (equally strong), and 0.90 versus 0.10.
- 3
VAFY: The total proportion of variation accounted for by the components in the outcome with levels 0.02, 0.50, and 0.80.
All factors were crossed, resulting in a simulation experiment with 3×3×3=27 conditions. The number of observations and variables was fixed to I=100 and J=200 respectively, 80% of the component weight coefficients were set equal to zero (this is 320 of the in total 400 coefficients), and the regression weights for the first and second component were set equal to b_{1}=1 and b_{2}=−0.02, implying that the first component is much more associated to the outcome than the second one (for equally strong components).
We expect SPCR to perform well - in terms of recovering the components - in all conditions where the components account for a considerable amount of variation in the covariates (VAFX = 0.40/0.70) but not when the components are submerged in the noise (VAFX = 0.01). In terms of prediction, SPCR can be expected to perform well when the components not only account for variation in the covariates but also in the outcome; when VAFY = 0.02 predictive performance can be expected to be bad for any method, including SPCR. For SPLS, we expect good performance when the components account for quite some variation both in the covariates and the outcome (VAFX = 0.40/0.70 and VAFY = 0.50/0.80) but poor performance when eiter VAFX or VAFY is low. Lastly, we expect SPCovR to perform well in terms of recovering the components when either VAFX or VAFY is considerable but not when both are low (VAFX = 0.01 and VAFY = 0.02). In terms of prediction, performance of SPCovR will be bad when VAFY = 0.02.
Note that the initial matrix X^{(0)} was not generated under a SPCovR model. To obtain data that perfectly fit such a model, a principal covariates regression analysis with fixed zero weights was performed to yield sparse component weights W and orthogonal loadings P. Again, the weights were rescaled and a block of covariates X^{ T R U E }=X^{(0)}WP^{ T } and the outcome y^{ T R U E }=X^{ T R U E }W[b_{1}b_{2}]^{ T } were calculated on the basis of the scaled component weights and the loadings resulting from the SPCovR analysis. These are data with no noise and in a final step noise was added by sampling from the normal distribution with mean zero and variance set in correspondence to the level of the proportion of variation accounted for by the components in the covariates and the outcome yielding data X and y. For each of the 27 conditions, 20 replicate data sets were generated resulting in 540 data sets in total. The code used to generate the data, including the seed used to initialize the pseudo random number generator, can be found on GitHub: https://github.com/katrijnvandeun/SPCovR.
Each dataset was subjected to five analyses: sparse principal components regression (SPCR), sparse partial least squares (SPLS), and three SPCovR analyses with different values of α, namely 0.01, 0.50, and 0.99. For the SPCR analysis, we used the elasticnet R package that implements the sparse PCA approach in [18]. The elasticnet R package uses a least angle regression (LARS) [17] procedure and hence allows to find a solution with a defined number of zeros for each of the components. We set this number equal to the exact number of zero coefficients occuring in W. For the SPLS analysis, the R package RGCCA was used that allows to (approximately) set the total number of zero coefficients over the components; the analyses options were set to result in (approximately) 320 zero coefficients. For the SPCovR analyses, we used stability selection with the upperbound on the number of non-zero coefficients q set equal to 80. Hence all analyses were tuned such that they had exactly or approximately the same number of zeroes as present in the true underlying component weight matrix. This makes the interpretation of the resuls easier in the sense that performance of the methods is not dependent upon proper tuning of the sparseness penalty. In the comparison of the methods, we will consider their performance in recovering the underlying components and how well they predict a new set of test data (generated under the same model, this is with the same W, regression weights, and the same error level for the covariates and outcome).
this is the cosine of the angle between the vectors vec(T) and \(\text {vec} (\hat {\mathbf {T}})\) with higher values indicating more similarity between the components. Values over 0.95 indicate that the components can be considered equal while values in the range [ 0.85−0.94] correspond to a fair similarity [24]. In Fig. 1 the Tucker congruence of the 20 replicate data sets is shown for the 27 conditions and the three methods (SPCovR with α=0.99, SPCR, and SPLS). For each combination of the variation accounted for in the covariates and in the outcome (e.g., VAFX = 0.01 and VAFY = 0.02 at the left of the left panel), three boxplots are shown for the three methods. These are the three levels of the relative strength factor with the boxplots at the left referring to the conditions where the first component is weaker than the second one, the boxplots in the middle referring to the conditions where they are equally strong, and boxplots at the right to the conditions where the second component is stronger than the first one. SPCovR outperforms the two other methods in all conditions, followed by SPCR which outperforms SPLS in most conditions. Only when the variance accounted for by the components in the block of covariates (the conditions VAFX = 0.01) is very low while it is high in the outcome variable (VAFY = 0.50/0.80), SPLS outperforms SPCR by taking advantage of the information included in the outcome. SPCovR, in all conditions, takes advantage of putting some weight on modeling the outcome in the construction of the components.
SPCovR was run with three levels of the weighting parameter α: namely α=0.01, α=0.50, and α=0.99. For both performance measures and in all conditions, SPCovR with α=.99 yields the best results. Hence, it seems that little weight should be given to fitting the outcome in order to obtain good results in terms both of recovering the components and prediction of the outcome. Note that giving no weight at all to the outcome in modeling the components, this is a SPCR analysis, leads to worse recovery in general. For prediction, on the other hand, the gain of using SPCovR is limited to a few conditions and, when the noise in the covariates is considerable, SPCovR is prone to overfitting while SPCR is not.
Systems biology study of the flu vaccine
We will illustrate SPCovR and compare with SPLS using data that result from a systems biology study on vaccination against influenza [2]. The general aim of this study was to predict vaccine efficacy with micro-array gene expression data obtained soon after vaccination and to gain insight in the underlying biological mechanisms. First we will give a general description of the data and how these were pre-processed, then we wil discuss the SPCovR and SPLS analyses and results.
The authors made data for two seasons, 2007 and 2008, publicly available on the NCBI Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) with accession numbers GSE29614 and GSE29617. For both seasons, a micro-array analysis was performed on the genomewide expression in peripheral blood mononuclear cells collected just before and 3 days after vaccination for all participants (26 in 2008 and 9 in 2007). Two different array platforms for measuring gene expression were used but the first 54,675 of 54,715 probe sets of the 2008 season are shared with the 2007 season. Hence, we can use the 2007 data as an independent test sample. Note that the choice for taking the 2007 data as the test set is motivated by the extremely small sample size. The RMA algorithm (Robust Multichip Average; see Irrizary et al. 2003) was used to pre-process the CEL-files. The data collected just before vaccination were considered as the baseline and subtracted from the data at Day 3. For each variable (probeset), the difference scores were centered and scaled to sum-of-squares equal to one. These scaled difference scores form the set of predictor scores X in the SPCovR and sparse PLS analyses.
To assess the efficacy of the vaccine, three types of plasma hemagglutination inhibition (HAI) antibody titers were assessed just before and 28 days after vaccination. As described by [2] vaccine efficacy was measured by subtracting the log-transformed antibody titers at baseline from the log-transformed antibody titers 28 days later and taking the maximum of these three baseline-corrected outcomes (to reduce the influence of subjects who started with high antibody concentrations due to previous infection). These maximal change scores were centered, resulting in the scores used as the outcome variable y in the SPCovR and sparse PLS analyses.
Next we turn to the analyses with imposed sparseness on the component weights. The metaparameters of the SPCovR model were set using the proposed stepwise model selection procedure. Hence, based on Fig. 3 a model with two components was selected, α was set to a value close to one (α=.99), and the ridge penalty was set equal to.05λ_{1}. In the stability selection procedure, we used N=500 resamples, the threshold π was set equal to 0.90 and E(V)=1. This results in q_{ R }=416. We compare with the sparse PLS results from two R packages, RGCCA [11] and spls [10] also using R=2 components and tuned such that approximately 416 non-zero component weights were obtained in order to have similar sparseness of the sparse PCovR and sparse PLS solutions. spls [10] uses a univariate soft thresholding approach, this is λ_{2}→∞. The SGCCA function in RGCCA [11] was used with the default option for tuning the ridge penalty.
Fit of modeled to observed data for three methods: SPCovR, spls, and SGCCA
Method | VAF | \(r(\hat {y},y)^{2}\) | \(r(\hat {y}_{2007},y_{2007})^{2}\) |
---|---|---|---|
SPCovR | 0.19 | 0.42 | 0.79 |
spls | 0.99 | 0.55 | |
SGCCA | 0.11 | 1 | 0.53 |
Percentage of variance accounted for in the block of covariates (VAFX) and in the outcome \(\left (r(y,\hat {y})^{2}\right)\) by each of the SPCovR and SGCCA components
Component 1 | Component 2 | ||
---|---|---|---|
SPCOVR | VAFX | 0.10 | 0.08 |
\(r(y,\hat {y})^{2}\) | 0.01 | 0.40 | |
SGCCA | VAFX | 0.07 | 0.04 |
\(r(y,\hat {y})^{2}\) | 0.79 | 0.20 |
Another criterion that is important when comparing the different solutions is related to the interpretation of the solution: Do thecomponents reflect a common biological theme that gives insight into the mechanisms that underly vaccines? To answer this question, a functional annotation based on the strength of association of the genes with the components can be performed. The SPCovR and SGCCA results contain such information in two ways, namely in the component weights and in the loadings. The component weights reflect those genes (probesets) that have the strongest contribution to the component scores. Because of the sparseness restriction, only few of them are non-zero. The loadings, on the other hand, reflect the strength of association between the expression values of a particular gene and the component scores. Whereas the component weights measure the unique contribution of a gene on top of the other genes, the loadings measure the strength of association without taking the other genes into account (this is comparable to the interpretation of partial versus univariate correlations); see also [25].
Significantly enriched gene ontology classes
Biological process | Nr of genes found | Nr of genes expected | +/− | P-value |
---|---|---|---|---|
rRNA methylation | 5 | .21 | + | 2.03E−02 |
Cellular macromolecule metabolic process | 89 | 58.65 | + | 1.68E−02 |
Nucleic acid metabolic process | 60 | 34.14 | + | 2.84E−02 |
Cellular component organization or biogenesis | 75 | 47.15 | + | 3.86E−02 |
Gene expression | 57 | 31.88 | + | 3.30E−02 |
Leukocyte activation | 18 | 4.59 | + | 6.11E−03 |
Cell activation | 20 | 5.36 | + | 2.88E−04 |
Immune system process | 31 | 13.13 | + | 2.79E−02 |
Immune effector process | 19 | 5.25 | + | 9.41E−03 |
Negative regulation of metabolic process | 32 | 14.16 | + | 4.59E−02 |
Second, we performed an enrichment analysis based on the loadings. The loadings reflect the strenght of association of a gene with the component with higher loadings indicating that the gene is more important for the process at play. Both PANTHER and GSEA [29] accept as input lists of genes together with a value that indicates the importance of the gene^{2}. Output resulting from the enrichment analyses can be found online (https://github.com/katrijnvandeun/SPCovR), here we summarize the main results. A first result of interest is that the same kind of processes are recovered from the enrichment analyses of the loadings as obtained previously when looking for over-represented classes in the gene lists with non-zero component weights. Also here, the annotation of the loadings obtained with SPCovR shows evidence of immune related processes while such evidence is weak for the SGCCA loadings. Notably, some immune related gene ontology terms are found in the enrichment analyses of the SGCCA loadings. In fact, overall more terms are recovered from the enrichment analyses of the loadings. This could be expected given the small number of genes involved in the lists obtained from the non-zero component weights.
Taken together, the results suggest that SPCovR, by putting more emphasis on accounting for the structural variation in the gene expression data when building the prediction model, catches the processes that are important in establishing the immune response to the vaccine. This pays off in the sense that a more stable prediction model is obtained that has better generalizability (and thus better prediction for the held out sample).
Discussion
Often a large amount of variables is measured with a double goal: Predicting an outcome of interest and obtaining insight in the mechanisms that relate the predicting variables to the outcome. In the high-dimensional setting this comes along with a variable selection problem. Principal covariates regression is a promising tool to reach this double goal; we extended this tool to the high-dimensional setting by introducing a sparse version of the PCovR model and offerering a flexible and efficient estimation procedure.
In this paper we showed through simulation that sparse PCovR can outperform sparse PLS as it allows to put less emphasis on modeling the outcome: By putting more weight on accounting for the variation in the covariates, more insight in the processes that underly the data may be obtained and this, in turn, results in better out-of-sample prediction. The benefit of this was illustrated for publicly available data: clearly a meaningful annotation of the selected genes was obtained with SPCovR while no enriched terms were found for the genes selected by sparse PLS. At the same time, the SPCovR analysis resulted in a much better out-of-sample prediction.
Appendix
Derivation of an algorithm for sparse PCovR
Here we will discuss the estimation of the loadings and component weights in the alternating procedure presented in Algorithm 1.
Conditional estimation of the loadings
such that diag(P^{ T }P)=1 (oblique case) or (P^{ T }P)= I (orthogonal case) and with \(k=\lambda _{1}|\mathbf {W}|_{1} +\lambda _{2}|\mathbf {W}|_{2}^{2}\) a constant.
The solution to the orthogonal case is given by P=VU^{ T } with U and V from the singular value decomposition of T^{ T }Z. When the number of variables is much larger than the number of observations a more efficient procedure is to calculate the eigen-value decomposition of the R×R matrix T^{ T }ZZ^{ T }T and to use the resulting eigenvectors and eigenvalues to obtain VU^{ T } (see the implementation for details).
Conditional estimation of the component weights
The procedure may seem slow given the loop over RJ coefficients and the involvement of large matrices in the computations. We accounted for these computational issues in our implementation by rewriting the expressions in (21) and making use of the properties of P. A further speed up may be obtained by warm restarts and active set learning, this is cycling over the non-zero coefficients only. Further improvement over the GLMnet procedure [20] was obtained by accounting for its dependence on the order of the variables as described in [31].
Footnotes
- 1.
[30] proposed a so-called sparse principal components regression method that in fact is a sparse covariates regression method. As this method, implemented in the spcr R package, gave an out-of-memory failure on the illustrative example we do not consider it further.
- 2.
The reason to also consider GSEA and not only PANTHER is that the latter only allowed to use a very limited set of gene ontology terms in the enrichment analysis, unlike in the overrepresentation analysis.
Notes
Acknowledgements
We thank the anonymous reviewers for providing us with helpful comments on earlier drafts of the manuscript.
Funding
This research was funded by a personal grant from the Netherlands Organisation for Scientific Research [NWO-VIDI 452.16.012] awarded to Katrijn Van Deun. The funder did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Availability of data and materials
The source code implementing our proposed method is available from GitHub, together with all scripts used to extract, pre-process, analyze, and post-process the data: https://github.com/katrijnvandeun/SPCovR.
Authors’ contributions
KVD and EC: conception and design of the study. KVD and EAVC: analysis and interpretation of the empirical data. KVD: algorithm development and implementation, drafted the manuscript. All authors critically read and approved of the manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, Trent JM, Staudt LM, Hudson J, Boguski MS, Lashkari D, Shalon D, Botstein D, Brown PO. The Transcriptional Program in the Response of Human Fibroblasts to Serum. Science. 1999; 283(5398):83–7. https://doi.org/10.1126/science.283.5398.83.CrossRefPubMedGoogle Scholar
- 2.Nakaya HI, Wrammert J, Lee EK, Racioppi L, Marie-Kunze S, et al.Systems biology of vaccination for seasonal influenza in humans. Nature immunology. 2011; 12(8):786–95. https://doi.org/10.1038/ni.2067.CrossRefPubMedPubMedCentralGoogle Scholar
- 3.Jolliffe IT. Principal Components Analysis, 2nd ed. New York: Springer Series in Statistics. Springer; 2002.Google Scholar
- 4.Hadi AS, Ling RL. Some cautionary notes on the use of principal components regression. Am Stat. 1998; 52(1):15–19. https://doi.org/10.1080/00031305.1998.10480530.Google Scholar
- 5.Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B. 1996; 58:267–88.Google Scholar
- 6.Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal (Stat Soc Ser B Stat Methodol). 2005; 67(2):301–20.CrossRefGoogle Scholar
- 7.Boulesteix AL, Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Brief Bioinform. 2007; 8(1):32–44. https://doi.org/10.1093/bib/bbl016 arxiv http://bib.oxfordjournals.org/content/8/1/32.full.pdf+html.CrossRefPubMedGoogle Scholar
- 8.de Jong S, Kiers HAL. Principal covariates regression: Part I. Theory. Chemometrics and Intelligent Laboratory Systems. 1992; 14(1–3):155–164101016016974399280100. NORWEGIAN CHEM SOC; CITY BERGEN; STATOIL CHEM & PLAST; NORSK HYDRO; PHARMACIA.CrossRefGoogle Scholar
- 9.Vervloet M, Van Deun K, den Noortgate WV, Ceulemans E. Model selection in principal covariates regression. Chemometr Intell Lab Syst; 151:26–33. https://doi.org/10.1016/j.chemolab.2015.12.004.
- 10.Chun H, Keleş S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J Royal Stat Soc Ser B Stat Methodol. 2010; 72(1):3–25.CrossRefGoogle Scholar
- 11.Tenenhaus A, Philippe C, Guillemot V, Le Cao KA, Grill J, Frouin V. Variable selection for generalized canonical correlation analysis. Biostat. 2014; 15(3):001. https://doi.org/10.1093/biostatistics/kxu001.CrossRefGoogle Scholar
- 12.Kiers HAL. Towards a standardized notation and terminology in multiway analysis. J Chemometr. 2000; 14:105–22.CrossRefGoogle Scholar
- 13.Heij C, Groenen PJ, van Dijk D. Forecast comparison of principal component regression and principal covariate regression. Comput Stat Data Anal. 2007; 51(7):3612–25.CrossRefGoogle Scholar
- 14.Stone M, Brooks RJ. Continuum regression: cross-validated sequentally constructed prediction embracing ordinary least squares, partial least squares and principal component regression (with discussion). J R Statistics Soc B. 1990; 52:237–69.Google Scholar
- 15.Zou H. The Adaptive Lasso and Its Oracle Properties. J Am Stat Assoc. 2006; 101(476):1418–29. https://doi.org/10.1198/016214506000000735.CrossRefGoogle Scholar
- 16.Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Statistica Sinica. 2010; 20:101–48.PubMedPubMedCentralGoogle Scholar
- 17.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Stat. 2004; 32(2):407–51.CrossRefGoogle Scholar
- 18.Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. J Comput Graphical Stat. 2006; 15:265–86.CrossRefGoogle Scholar
- 19.Vervloet M, Van Deun K, Van den Noortgate W, Ceulemans E. On the selection of the weighting parameter value in principal covariates regression. Chemometr Intell Lab Syst. 2013; 123:36–43.CrossRefGoogle Scholar
- 20.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010; 33(1):1–22.CrossRefPubMedPubMedCentralGoogle Scholar
- 21.Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J Royal Stat Soc Ser B Stat Methodol. 2011; 73(3):273–82. https://doi.org/10.1111/j.1467-9868.2011.00771.x.CrossRefGoogle Scholar
- 22.Meinshausen N, Bühlmann P. Stability selection. J Royal Stat Soc Ser B Stat Methodol. 2010; 72(4):417–73. https://doi.org/10.1111/j.1467-9868.2010.00740.x.CrossRefGoogle Scholar
- 23.Friedman J, Hastie T, Hofling H, Tibshirani R. Pathwise coordinate optimization. Ann Appl Stat. 2007; 2:302–32.CrossRefGoogle Scholar
- 24.Lorenzo-Seva U, ten Berge JMF. Tucker’s congruence coefficient as a meaningful index of factor similarity. Methodology. 2006; 2(2):57–64. https://doi.org/10.1016/j.cell.2012.10.012.CrossRefGoogle Scholar
- 25.Van Deun K, Wilderjans TF, van den Berg RA, Antoniadis A, Van Mechelen I. A flexible framework for sparse simultaneous component based data integration. BMC Bioinformatics. 2011; 12:448. https://doi.org/10.1186/1471-2105-12-448.CrossRefPubMedPubMedCentralGoogle Scholar
- 26.Mi H, Huang X, Muruganujan A, Tang H, Mills C, Kang D, Thomas PD. Panther version 11: expanded annotation data from gene ontology and reactome pathways, and data analysis tool enhancements. Nucleic Acids Res. 2017; 45(D1):183. http://doi.org/10.1093/nar/gkw1138.CrossRefGoogle Scholar
- 27.Halstead ES, Chroneos ZC. Lethal influenza infection: Is a macrophage to blame?Expert Rev Anti-Infect Ther. 2015; 13(12):1425–28. https://doi.org/10.1586/14787210.2015.1094375. PMID: 26414622. arxiv http://doi.org/10.1586/14787210.2015.1094375.CrossRefPubMedPubMedCentralGoogle Scholar
- 28.Mock T. Identification and characterisation of protein phosphatase 1, catalytic subunit alpha (pp1alpha) as a regulator of nf-kappab in t lymphocytes. 2012. Unpublished Doctoral Dissertation: http://www.ub.uni-heidelberg.de/archiv/13079.
- 29.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005; 102(43):15545–50. https://doi.org/10.1073/pnas.0506580102. arxiv http://www.pnas.org/content/102/43/15545.full.pdf.CrossRefPubMedPubMedCentralGoogle Scholar
- 30.Kawano S, Fujisawa H, Takada T, Shiroishi T. Sparse principal component regression with adaptive loading. Comput Stat Data Anal. 2015; 89(C):192–203. https://doi.org/10.1016/j.csda.2015.03.016.CrossRefGoogle Scholar
- 31.Yuan GS, Ho CH, Lin CJ. An improved glmnet for l1-regularized logistic regression. J Mach Learn Res. 2012; 13:1999–2030.Google Scholar
Copyright information
Open Access Thisarticle is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.