Abstract
Enrichment of predictive models with new biomolecular markers is an important task in high-dimensional omic applications. Increasingly, clinical studies include several sets of such omics markers available for each patient, measuring different levels of biological variation. As a result, one of the main challenges in predictive research is the integration of different sources of omic biomarkers for the prediction of health traits. We review several approaches for the combination of omic markers in the context of binary outcome prediction, all based on double cross-validation and regularized regression models. We evaluate their performance in terms of calibration and discrimination and we compare their performance with respect to single-omic source predictions. We illustrate the methods through the analysis of two real datasets. On the one hand, we consider the combination of two fractions of proteomic mass spectrometry for the calibration of a diagnostic rule for the detection of early stage breast cancer. On the other hand, we consider transcriptomics and metabolomics as predictors of obesity using data from the Dietary, Lifestyle, and Genetic determinants of Obesity and Metabolic syndrome (DILGOM) study, a population-based cohort, from Finland.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The authors “Mar Rodríguez-Girondo” and “Alexia Kakourou” contributed equally to this work.
References
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Bühlmann, P., & Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22, 477–505.
Cox, D. R. (1958). Two further applications of a model for binary regression. Biometrika, 45, 562–565.
de Noo, M. E., Deelder, A. M., Mertens, B. J. A., Ozalp, A., Bladergroen, M. R., van der Werff, M. P. J., & Tollenaar, R. A. E. M. (2005). Detection of colorectal cancer using MALDI-TOF serum protein profiling. European Journal of Cancer, 42, 1068–1076.
Hand, D. J. (1997). Construction and assessment of classification rules. Chichester: Wiley.
Hand, D. J. (2006). Classifier technology and the illusion of progress. Statistical Science, 21, 1–18.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). Elements of statistical learning: Data mining, inference, and prediction. Springer series in statistic. New York: Springer
Hoerl, A. E., & Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.
Inouye, M., Kettunen, J., Soininen, P., Silander, K., Ripatti, S., et al. (2010). Metabonomic, transcriptomic, and genomic variation of a population cohort. Molecular Systems Biology, 6, 441.
Inouye, M., Silander, K., Hamalainen, E., Salomaa, V., Harald, K., Jousilahti, P., et al. (2010). An immune response network associated with blood lipid levels. Plos Genetics, 6, e1001113. doi:10.1371/journal.pgen.1001113.
Jonathan, P., Krzanowski, W. J., & McCarthy, M. V. (2000). On the use of cross-validation to assess performance in multivariate prediction. Statistics and Computing, 10, 209–229.
Kakourou, A., Vach, W., & Mertens B. (2014). Combination approaches improve predictive performance of diagnostic rules for mass-spectrometry proteomic data. Journal of Computational Biology, 21, 898–914.
Kneib, T., Hothorn, T., & Tutz, G. (2009). Variable selection and model choice in geoadditive regression models. Biometrics, 65, 626–634.
Leblanc, M., & Tibshirani, R. (1996). Combining estimates in regression and classification. Journal of the American Statistical Association, 91, 1641–1650.
Le Cessie, S., & van Houwelingen, J. C. (1992). Ridge estimators in logistic regression. Applied Statistics, 41, 191–201.
Liu, H., DÁndrade, P., Fulmer-Smentek, S., Lorenzi, P., Kohn, K. W., Weinstein, J. N., Pommier, Y., & Reinhold, W. C. (2010). mRNA and microRNA expression profiles of the NCI-60 integrated with drug activities. Molecular Cancer Therapeutics, 9, 1080–1091.
Meier, L., van de Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society B, 70, 53–71.
Mertens, B. J. A. (2003). Microarrays, pattern recognition and exploratory data analysis. Statistics in Medicine, 22, 1879–1899
Mertens, B. J. A., de Noo, M. E., Tollenaar, R. A. E. M., & Deelder, A. M. (2006). Mass spectrometry proteomic diagnosis: Enacting the double cross validatory paradigm. Journal of Computational Biology, 13, 1591–1605.
Mertens, B. J. A., van der Burgt, Y. E. M., Velstra, B., Mesker, W. E., Tollenaar, R. A. E. M., & Deelder, A. M. (2011). On the use of double cross-validation for the combination of proteomic mass spectral data for enhanced diagnosis and prediction. Statistics and Probability Letters, 81, 759–766.
Pencina, M. J., D’Agostino, R. B. Sr., D’Agostino, R. B. Jr., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Statistics in Medicine, 27, 157–172.
Pepe, M. S., Kerr, K. F., Longton, G., & Wang, Z. (2013). Testing for improvement in prediction model performance. Statistics in Medicine, 32, 1467–1482.
Rodríguez-Girondo, M., Salo, P., Burzykowski, T., Perola, M., Houwing-Duistermaat, J. J., & Mertens, B. (2016) Sequential double cross-validation for augmented prediction assessment in high-dimensional omic applications. Working Paper in ArXiv. arXiv:1601.08197v1.
Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M. J., & Kattan, M. W. (2010). Assessing the performance of prediction models: A framework for some traditional and novel measures. Epidemiology, 21, 128–138.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). Journal of the Royal Statistical Society. Series B, 36, 111–147.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58, 267–288.
Tutz, G., & Binder, H. (2006). Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics, 62, 961–971.
van de Wiel, M. A., Lien, T. G., Verlaat, W., van Wieringen, W. N., & Wilting, S. M. (2015). Better prediction by use of co-data: Adaptive group-regularized ridge regression Statistics in Medicine, 35, 368–381.
van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. U.C. Berkeley Division of Biostatistics Working Paper Series, Working Paper 222.
Vickers, A. J., & Elkin, E. B. (2006). Decision curve analysis: A novel method for evaluating prediction models. Medical Decision Making, 26, 565–574.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B, 67, 301–320.
Acknowledgements
The work is supported by funding from the European Community’s Seventh Framework Programme FP7/2011: Marie Curie Initial Training Network MEDIASRES with the Grant Agreement Number 290025 and by funding from the European Union’s Seventh Framework Programme FP7/Health/F5/2012: MIMOmics under the Grant Agreement Number 305280. We thank Sigrid Jusélius Foundation and Yrjö Jahnsson Foundation for providing the DILGOM expression data and Yrjö Jahnsson Foundation for providing DILGOM NMR metabolomics.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Rodríguez-Girondo, M. et al. (2017). On the Combination of Omics Data for Prediction of Binary Outcomes. In: Datta, S., Mertens, B. (eds) Statistical Analysis of Proteomics, Metabolomics, and Lipidomics Data Using Mass Spectrometry. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-45809-0_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-45809-0_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45807-6
Online ISBN: 978-3-319-45809-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)