Abstract
We address the issue of variable preselection in high-dimensional penalized regression, such as the lasso, a commonly used approach to variable selection and prediction in genomics. Preselection—to start with a manageable set of covariates—is becoming increasingly necessary for enabling advanced analysis tasks to be carried out on data sets of huge size created by high throughput technologies. Preselection of features to be included in multivariate analyses based on simple univariate ranking is a natural strategy that has often been implemented despite its potential bias. We demonstrate this bias and propose a way to correct it. Starting with a sequential implementation of the lasso with increasing lists of predictors, we exploit a property of the set of corresponding cross-validation curves, a pattern that we call “freezing”. The ranking of the predictors to be included sequentially is based on simple measures of associations with the outcome, which can be pre-computed in an efficient way for ultra high dimensional data sets externally to the penalized regression implementation. We demonstrate by simulation that our sequential approach leads in a vast majority of cases to a safe and efficient way of focusing the lasso analysis on a smaller and manageable number of predictors. In situations where the lasso performs well, we need typically less than 20 % of the variables to recover the same solution as if using the full set of variables. We illustrate the applicability of our strategy in the context of a genome-wide association study and on microarray genomic data where we need just 2. 5 % and 13 % of the variables respectively. Finally we include an example where 260 million gene-gene interactions are ranked and we are able to recover the lasso solution using only 1 % of these. Freezing offers great potential for extending the applicability of penalized regressions to current and upcoming ultra high dimensional problems in bioinformatics. Its applicability is not limited to the standard lasso but is a generic property of many penalized approaches.
Authors Ingrid K. Glad, Sylvia Richardson contributed equally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ambroise, C., McLachlan, G.J.: Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. 99(10), 6562–6566 (2002)
Bair, E., Hastie, T., Paul, D., Tibshirani, R.: Prediction by supervised principal components. J. Am. Stat. Assoc. 101(473), 119–137 (2006)
Bien, J., Taylor, J., Tibshirani, R.: A lasso for hierarchical testing of interactions. Ann. Stat. 41(3), 1111–1141 (2013)
Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, Berlin (2011)
Cantor, R.M., Lange, K., Sinsheimer, J.S.: Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am. J. Hum. Genet. 86(1), 6–22 (2010)
Cho, S., Kim, K., Kim, Y.J., Lee, J.-K., Cho, Y.S., Lee, J.-Y., Han, B.-G., Kim, H., Ott, J., Park, T.: Joint identification of multiple genetic variants via elastic-net variable selection in a genome-wide association analysis. Ann. Hum. Genet. 74(5), 416–428 (2010)
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32, 407–499 (2004)
El Ghaoui, L., Viallon, V., Rabbani, T.: Safe feature elimination for the lasso and sparse supervised learning problems. ArXiv e-prints 1009.4219 (2011)
Fan, J., Lv, J.: Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 70(5), 849–911 (2008)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
Genovese, C.R., Jin, J., Wasserman, L., Yao, Z.: A comparison of the lasso and marginal regression. J. Mach. Learn. Res. 13(1), 2107–2143 (2012)
Hamza, T.H., Zabetian, C.P., Tenesa, A., Laederach, A., Montimurro, J., Yearout, D., Kay, D.M., Doheny, K.F., Paschall, J., Pugh, E., Kusel, V.I., Collura, R., Roberts, J., Griffith, A., Samii, A., Scott, W.K., Nutt, J., Factor, S.A., Payami, H.: Common genetic variation in the HLA region is associated with late-onset sporadic parkinsons disease. Nat. Genet. 42(9), 781–785 (2010)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer Series in Statistics. Springer, New York (2009)
Meinshausen, N.: Relaxed lasso. Comput. Stat. Data Anal. 52(1), 374–393 (2007)
Reppe, S., Refvem, H., Gautvik, V.T., Olstad, O.K., Høvring, P.I., Reinholt, F.P., Holden, M., Frigessi, A., Jemtland, R., Gautvik, K.M.: Eight genes are highly associated with BMD variation in postmenopausal caucasian women. Bone 46(3), 604–612 (2010)
Simon, R.M., Korn, E.L., McShane, L.M., Radmacher, M.D., Wright, G.W., Zhao, Y.: Design and analysis of DNS microarray investigations. In: Statistics for Biology and Health. Springer, New York (2004)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 58, 267–288 (1996)
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., Tibshirani, R.J.: Strong rules for discarding predictors in lasso-type problems. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 74(2), 245–266 (2012)
van de Geer, S., Bühlmann, P., Zhou, S.: The adaptive and the thresholded lasso for potentially misspecified models (and a lower bound for the lasso). Electron. J. Stat. 5, 688–749 (2011)
Waldmann, P., Mészáros, G., Gredler, B., Fuerst, C., Sölkner, J.: Evaluation of the lasso and the elastic net in genome-wide association studies. Frontiers in Genetics, 4, 270. http://doi.org/10.3389/fgene.2013.00270 (2013)
Waldron, L., Pintilie, M., Tsao, M.-S., Shepherd, F.A., Huttenhower, C., Jurisica, I.: Optimized application of penalized regression methods to diverse genomic data. Bioinformatics 27(24), 3399–3406 (2011)
Yang, C., Wan, X., Yang, Q., Xue, H., Yu, W.: Identifying main effects and epistatic interactions from large-scale snp data via adaptive group lasso. BMC Bioinf. 11(Suppl. 1), S18 (2010)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 68(1), 49–67 (2006)
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)
Acknowledgements
This research was supported by grant number 204664 from the Norwegian Research Council (NRC) and by Statistics for Innovation (sfi)2, a centre for research based innovation funded by NRC. SR and LCB spent a research period in Paris at Inserm UMRS937, and SR has an adjunct position at (sfi)2. IA was funded by a grant from the Agence Nationale de la Recherche (ANR Maladies neurologiques et maladies psychiatriques) as part of a project on the relation between Parkinson’s disease and genes involved in the metabolism and transport of xenobiotics (PI: Alexis Elbaz, Inserm) for which access to GWAS data was obtained through dbGAP; this work utilized in part data from the NINDS DbGaP database from the CIDR:NGRC PARKINSONS DISEASE STUDY (Accession: phs000196.v2.p1). Sjur Reppe at Ullevaal University Hospital provided the Bone biopsy data.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix 1
Proof 1 (Proof of (3a) and (3b))
Fix \(\lambda\), and drop it from the notation for simplicity. Let
and similarly for \(f_{C_{F}}(\boldsymbol{\beta }_{C_{F}})\). Given \(\boldsymbol{\beta }_{C_{p}}\), we can form the vector in \(\mathbb{R}^{P}\) with \(\vert C_{F}\setminus C_{p}\vert \) zeros as \((\boldsymbol{\beta }_{C_{p}},\boldsymbol{\beta }_{C_{F}\setminus C_{p}} =\boldsymbol{ 0})\). For such a vector it holds
Next we show that the nonzero components of
are the same as the nonzero components of
when \(S_{F} \subseteq C_{p}\). In fact, because \(S_{F} \subseteq C_{p}\), we first have
Now we add some zero coefficients, such that
by (6). Hence
When we minimize \(f_{C_{F}}(\boldsymbol{\beta }_{C_{F}})\), we know that for the solution \(\hat{\boldsymbol{\beta }}_{C_{F}\setminus C_{p}} =\boldsymbol{ 0}\). Hence we can drop the constraint that \(\boldsymbol{\beta }_{C_{F}\setminus C_{p}} =\boldsymbol{ 0}\) in (7) and minimize over \(\boldsymbol{\beta }_{C_{F}\setminus C_{p}}\) also, without making any difference. We obtain that the nonzero components of
are the same as the nonzero components of
Let
Then for j ∈ S F , \(\hat{\boldsymbol{\beta }}_{j,C_{F}}\neq 0.\) Therefore since the nonzero components of
are the same as the nonzero components of
when \(S_{F} \subseteq C_{p}\), also \(\hat{\boldsymbol{\beta }}_{j,C_{p}}\neq 0\). The opposite is also true: if \(\hat{\boldsymbol{\beta }}_{j,C_{p}}\neq 0\), then j ∈ S F and \(\hat{\boldsymbol{\beta }}_{j,C_{F}}\neq 0\). Similarly for j ∉ S F . This proves that
-
(a)
\(S_{p}(\lambda ) = S_{F}(\lambda )\quad \forall p \geq p_{0}(\lambda ),\)
-
(b)
\(\hat{\beta }_{j,C_{p}}(\lambda ) =\hat{\beta } _{j,C_{F}}(\lambda ),\quad \forall p \geq p_{0}(\lambda ),\forall j.\)
Proof 2 (Proof of (4) and (5))
For fixed \(\lambda\) we have that if
then
By (3a) and (3b) it follows that for all \(p_{2} > p_{1} \geq p_{0,k}(\lambda )\) and \(\forall k\)
and
Then
because the last term in (10) is zero and the two last equalities in (11) and (12) follows from (8) and (9) respectively. Similarly we have
so that \(\hat{y}_{i,C_{p_{ 1}}}^{-k}(\lambda ) =\hat{ y}_{i,C_{p_{ 2}}}^{-k}(\lambda )\) holds \(\forall i\). Finally this implies \(CV _{C_{p_{ 1}}}(\lambda ) = CV _{C_{p_{ 2}}}(\lambda ) = CV _{C_{F}}(\lambda )\) and hereby (5).
Appendix 2
We collect here some further arguments which lead to the reordering of the Part 2 of our algorithm. Consider two consecutive cross-validation curves for \(C_{p_{m}}\) and \(C_{p_{m+1}}\), and assume that the two curves coincide in an interval \(\tilde{\varLambda }= [\tilde{\lambda },\lambda _{max}]\) which includes a minimum in \(\lambda _{p_{m}}^{{\ast}} >\tilde{\lambda }\). Part 1 of our algorithm would stop with p m variables and return the solution \(S_{p_{m}}(\lambda _{p_{m}}^{{\ast}})\). By definition of freezing, if \(S_{p_{m+1}}^{-k}(\lambda ) \subseteq C_{p_{m}}\) for all folds k and for all \(\lambda \in \tilde{\varLambda }\), then the two curves for \(C_{p_{m}}\) and \(C_{p_{m+1}}\) are identical in \(\lambda _{p_{m}}^{{\ast}}\) and in all other values of \(\lambda \in \tilde{\varLambda }\). Nevertheless \(S_{F}(\lambda ^{{\ast}})\) might not be included in \(C_{p_{m}}\), and hence \(S_{p_{m}}(\lambda _{p_{m}}^{{\ast}})\) is not the correct solution for the full data set. If on the contrary, there are some variables that are active in the cross-validation for \(C_{p_{m+1}}\) that are not in \(C_{p_{m}}\), that is for any k, \(S_{p_{m+1}}^{-k}(\lambda _{p_{m}}^{{\ast}})\not\subset C_{p_{m}}\), then the two curves would not coincide down and beyond \(\lambda _{p_{m}}^{{\ast}}\) and hence the algorithm would not erroneously stop. Therefore the sequence of preselected sets should be such that, while waiting for \(S_{F}(\lambda ^{{\ast}})\) to be included in a \(C_{p_{m}}\) (at which point the curves cannot change anymore in the minimum), the new active set \(S_{p_{m+1}}(\lambda _{p_{m}}^{{\ast}})\) in the current minimum \(\lambda _{p_{m}}^{{\ast}}\) includes typically some new variables which were not in the previous set \(C_{p_{m}}\). This leads to the idea of sequential reordering once we have found a first “local point of freezing”. The idea of Part 2 of our algorithm follows this line, and greedily constructs the new next set \(C_{p_{m+1}}\) by introducing new variables which have a high chance to be in \(S_{p_{m+1}}(\lambda _{p_{m}}^{{\ast}})\). This is done by reordering the unused variables based on the residuals \(\boldsymbol{r}\), computed using the selected variables \(S_{p_{m}}(\lambda _{p_{m}}^{{\ast}})\).
Appendix 3
Further details from the simulation studies are summarized here. First, we consider the linear regression model as described in the main manuscript, while results from experiments using a logistic regression model are reported thereafter.
3.1 Linear Regression Model
The results are reported for Scenario A, B and D and the data are generated as described in Sect. 3 in the main manuscript. We investigate how many variables are needed to avoid the preselection bias.
Comparing Scenario A, B and D, we see that freezing can be very useful not only in situations with no correlation among the covariates, but also in situations where the covariates are correlated. The results are quite similar, with a small advantage when the covariates are generated independently. This is possibly because the marginal correlation ranking captures the true nonzero coefficients earlier when there is little correlation among the covariates.
For all three scenarios, we observe that when models of less noise are considered (S N R ≈ 2), there are practically no situations for which the lasso selects less than 20 variables. When S N R ≈ 0. 5 there are more situations where the cross-validation curves have well-defined minima leading to a smaller number of selected variables, hence there is a greater advantage of using our approach in these situations. For example for the situations where the lasso selects less than 80 variables, the largest gain is observed when S N R ≈ 0. 5, where the average % of data needed to recover the optimal solution is not more than 15 %, 19 % and 15 % for the three different scenarios respectively.
Scenario A (Table 4 and Fig. 7)
Scenario B (Table 5 and Fig. 8)
Scenario D (Table 6 and Fig. 9)
6.1 Logistic Regression Model
Finally we do one experiment of 100 replications with a binary response. For simplicity, in this experiment we only consider the covariate matrix generated as in Scenario B and with P = 10, 000. Results are summarized in Table 7 and Fig. 10. Here we also observe situations where the cross-validated optimal solution is not well-defined and the lasso selects very many. Nevertheless in 57 out of 100 experiments the curves will be frozen down and below the minimum in \(\lambda ^{{\ast}}\) for less than 50 % of the data. In several cases it happens already for 20–30 % of the data.
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Bergersen, L.C., Ahmed, I., Frigessi, A., Glad, I.K., Richardson, S. (2016). Preselection in Lasso-Type Analysis for Ultra-High Dimensional Genomic Exploration. In: Frigessi, A., Bühlmann, P., Glad, I., Langaas, M., Richardson, S., Vannucci, M. (eds) Statistical Analysis for High-Dimensional Data. Abel Symposia, vol 11. Springer, Cham. https://doi.org/10.1007/978-3-319-27099-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-27099-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27097-5
Online ISBN: 978-3-319-27099-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)