Abstract
This chapter describes the role of machine learning approaches such as random forests in holistic discovery applications and provides a background for its better understanding. Their suitability for feature selection, data integration, and network modelling are also evaluated through recent examples in the literature. These examples cover a variety of fields, ranging from ecology to metabolomics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Stat Sci. 2003;18(1):71–103.
Shaffer JP. Multiple hypothesis testing. Annu Rev Psychol. 1995;46(1):561–84.
Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem. 2006;78(3):779–87.
Nicholson JK, Lindon JC, Holmes E. ‘Metabonomics’: understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. Xenobiotica. 1999;29(11):1181–9.
Fiehn O. Metabolomics – the link between genotypes and phenotypes. Plant Mol Biol. 2002;48(1–2):155–71.
Montoliu I, Genick U, Ledda M, Collino S, Martin FP, Le Coutre J, et al. Current status on genome-metabolome-wide associations: an opportunity in nutrition research. Genes Nutr. 2013;8(1):19–27.
Massart DL, Vandeginste BGM, Buydens LMC, De Jong S, Lewi PJ, Smeyers-Verbeke J. Handbook of chemometrics and qualimetrics. Amsterdam: Elsevier Science B.V.; 1997.
Jolliffe IT. Principal component analysis. New York: Springer; 2002.
Wold S, Sjöström M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst. 2001;58(2):109–30.
Geladi P, Kowalski BR. Partial least-squares regression: a tutorial. Anal Chim Acta. 1986;185(C):1–17.
Trygg J, Wold S. Orthogonal projections to latent structures (O-PLS). J Chemom. 2002;16(3):119–28.
Trygg J, Wold S. O2-PLS, a two-block (X-Y) latent variable regression (LVR) method with an integral OSC filter. J Chemom. 2003;17(1):53–64.
Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, Velzen EJJ, et al. Assessment of PLSDA cross validation. Metabolomics. 2008;4(1):81–9.
Barker M, Rayens W. Partial least squares for discrimination. J Chemom. 2003;17(3):166–73.
Bylesjö M, Rantalainen M, Cloarec O, Nicholson JK, Holmes E, Trygg J. OPLS discriminant analysis: combining the strengths of PLS-DA and SIMCA classification. J Chemom. 2006;20(8–10):341–51.
Westerhuis JA, van Velzen EJJ, Hoefsloot HCJ, Smilde AK. Multivariate paired data analysis: multilevel PLSDA versus OPLSDA. Metabolomics. 2010;6(1):119–28.
Cloarec O, Dumas ME, Craig A, Barton RH, Trygg J, Hudson J, et al. Statistical total correlation spectroscopy: an exploratory approach for latent biomarker identification from metabolic 1H NMR data sets. Anal Chem. 2005;77(5):1282–9.
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
Hagan MT, Menhaj MB. Training feedforward networks with the Marquardt algorithm. IEEE Trans Neural Netw. 1994;5(6):989–93.
Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46(1–3):389–422.
Breiman L, Friedman J, Olshen R, Stone C. Classification and regression trees. Boca Raton: CRC Press LLC; 1984.
Duda RO, Hart PE, Stork DG. Pattern classification. 2nd ed. New York: Wiley; 2001.
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Data mining, inference and prediction. 2nd ed. New York: Springer; 2009. p. 588.
Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.
Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Ann Stat. 2000;28(2):337–407.
Schapire RE, Singer Y. Improved boosting algorithms using confidence-rated predictions. Mach Learn. 1999;37(3):297–336.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002;2(3):18–22.
Borg I, Groenen P. Modern multidimensional scaling: theory and applications. New York: Springer Verlag; 2005.
Janitza S, Strobl C, Boulesteix AL. An AUC-based permutation variable importance measure for random forests. BMC Bioinforma. 2013;14:119.
Moutselos K, Maglogiannis I, Chatziioannou A, editors. Heterogeneous data fusion and selection in high-volume molecular and imaging datasets. IEEE 12th conference on Bioinformatics and Bioengineering proceedings 2012;407–412.
Viswanath S, Bloch BN, Rosen M, Chappelow J, Toth R, Rofsky N, et al. Integrating structural and functional imaging for computer assisted detection of prostate cancer on multi-protocol in vivo 3 tesla MRI. SPIE Medical Imaging 2009;7260.
Swatantran A, Dubayah R, Goetz S, Hofton M, Betts MG, Sun M, et al. Mapping migratory bird prevalence using remote sensing data fusion. PLoS ONE. 2012;7(1):e28922.
Latifi H, Nothdurft A, Straub C, Koch B. Modelling stratified forest attributes using optical/LiDAR features in a central European landscape. Int J Digit Earth. 2012;5(2):106–32.
Pang H, Lin A, Holford M, Enerson BE, Lu B, Lawton MP, et al. Pathway analysis using random forests classification and regression. Bioinformatics. 2006;22(16):2028–36.
Acharjee A, Kloosterman B, de Vos RCH, Werij JS, Bachem CWB, Visser RGF, et al. Data integration and network reconstruction with -omics data using Random Forest regression in potato. Anal Chim Acta. 2011;705(1–2):56–63.
Chen Z, Zhang W. Integrative analysis using module-guided random forests reveals correlated genetic factors related to mouse weight. PLoS Comput Biol. 2013;9(3):e1002956.
Tang X, Xiao J, Li Y, Wen Z, Fang Z, Li M. Systematic analysis revealed better performance of random forest algorithm coupled with complex network features in predicting microRNA precursors. Chemom Intell Lab Syst. 2012;118:317–23.
Lin N, Wu B, Jansen R, Gerstein M, Zhao H. Information assessment on predicting protein-protein interactions. BMC Bioinforma. 2004;5:154.
Lee J, Lee J. Hidden information revealed by optimal community structure from a protein-complex bipartite network improves protein function prediction. PLoS ONE. 2013;8(4):e60372.
Han P, Zhang X, Norton RS, Feng ZP. Large-scale prediction of long disordered regions in proteins using random forests. BMC Bioinforma. 2009;10:8.
Li ZC, Lai YH, Chen LL, Zhou X, Dai Z, Zou XY. Identification of human protein complexes from local sub-graphs of protein-protein interaction network based on random forest with topological structure features. Anal Chim Acta. 2012;718:32–41.
Zheng C, Wang M, Takemoto K, Akutsu T, Zhang Z, Song J. An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins. PLoS ONE. 2012;7(11):e49716.
Muppirala UK, Honavar VG, Dobbs D. Predicting RNA-protein interactions using only sequence information. BMC Bioinforma. 2011;12:489.
Mehan MR, Nunez-Iglesias J, Dai C, Waterman MS, Zhou XJ. An integrative modular approach to systematically predict gene-phenotype associations. BMC Bioinforma. 2010;11 Suppl 1:S62.
Yang ZR. Predicting sulfotyrosine sites using the random forest algorithm with significantly improved prediction accuracy. BMC Bioinforma. 2009;10:361.
Cao DS, Liang YZ, Deng Z, Hu QN, He M, Xu QS, et al. Genome-scale screening of drug-target associations relevant to Ki using a chemogenomics approach. PLoS ONE. 2013;8(4):e57680.
Heider D, Verheyen J, Hoffmann D. Predicting Bevirimat resistance of HIV-1 from genotype. BMC Bioinforma. 2010;11:37.
Yu H, Chen J, Xu X, Li Y, Zhao H, Fang Y, et al. A systematic prediction of multiple drug-target interactions from chemical, genomic, and pharmacological data. PLoS ONE. 2012;7(5):e37608.
Wang M, Zhao XM, Takemoto K, Xu H, Li Y, Akutsu T, et al. FunSAV: predicting the functional effect of single amino acid variants using a two-stage random forest model. PLoS ONE. 2012;7(8):e43847.
Pesch R, Zimmer R. Complementing the eukaryotic protein interactome. PLoS ONE. 2013;8(6):e66635.
Fernandez-Blanco E, Aguiar-Pulido V, Robert Munteanu C, Dorado J. Random forest classification based on star graph topological indices for antioxidant proteins. J Theor Biol. 2013;317:331–7.
Ko D, Windle B. Enriching for correct prediction of biological processes using a combination of diverse classifiers. BMC Bioinforma. 2011;12:189.
Masso M, Vaisman II. Accurate and efficient gp120 V3 loop structure based models for the determination of HIV-1 co-receptor usage. BMC Bioinforma. 2010;11:494.
Liu S, Chen Y, Wilkins D. Large margin classifiers and random forests for integrated biological prediction. Int J Bioinforma Res Appl. 2012;8(1–2):38–53.
Winham SJ, Colby CL, Freimuth RR, Wang X, de Andrade M, Huebner M, et al. SNP interaction detection with Random Forests in high-dimensional genetic data. BMC Bioinforma. 2012;13:164.
Wu Q, Ye Y, Liu Y, Ng MK. SNP selection and classification of genome-wide SNP data using stratified sampling random forests. IEEE Trans Nanobiosci. 2012;11(3):216–27.
Tripoliti EE, Fotiadis DI, Manis G. Automated diagnosis of diseases based on classification: dynamic determination of the number of trees in random forests algorithm. IEEE Trans Inf Technol Biomed. 2012;16(4):615–22.
Robnik-Sikonja M. Improving random forests. 2004.
Tripoliti EE, Fotiadis DI, Manis G. Modifications of the construction and voting mechanisms of the random forests algorithm. Data Knowl Eng. 2013;87:41–65.
Anaissi A, Kennedy PJ, Goyal M, Catchpoole DR. A balanced iterative random forest for gene selection from microarray data. BMC Bioinforma. 2013;14:261.
Xiao Y, Segal MR. Identification of yeast transcriptional regulation networks using multivariate random forests. PLoS Comput Biol. 2009;5(6):e1000414.
Jiang L. Learning random forests for ranking. Front Comput Sci China. 2011;5(1):79–86.
Bernard S, Adam S, Heutte L. Dynamic random forests. Pattern Recogn Lett. 2012;33(12):1580–6.
Li S, Fedorowicz A, Singh H, Soderholm SC. Application of the random forest method in studies of Local Lymph Node Assay based skin sensitization data. J Chem Inf Model. 2005;45(4):952–64.
Garge NR, Bobashev G, Eggleston B. Random forest methodology for model-based recursive partitioning: the mobForest package for R. BMC Bioinforma. 2013;14:125.
Leistner C, Saffari A, Santner J, Bischof H, editors. Semi-supervised random forests. 2009.
Zeng JY, Cao XH, Gan JY. An improvement of AdaBoost for face detection with random forests. ed. CCIS; 2010;93: 22–9.
Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. Inferring regulatory networks from expression data using tree-based methods. PLoS ONE. 2010;5(9):e12776.
Chang JS, Yeh RF, Wiencke JK, Wiemels JL, Smirnov I, Pico AR, et al. Pathway analysis of single-nucleotide polymorphisms potentially associated with glioblastoma multiforme susceptibility using random forests. Cancer Epidemiol Biomarkers Prev. 2008;17(6):1368–73.
Chung RH, Chen YE. A two-stage random forest-based pathway analysis method. PLoS ONE. 2012;7(5):e36662.
Pang H, Zhao H. Building pathway clusters from Random Forests classification using class votes. BMC Bioinforma. 2008;9:87.
Collino S, Martin F-P, Montoliu I, Barger J, Da Silva L, Prolla T, et al. Transcriptomics and metabonomics identify essential metabolic signatures in calorie restriction (CR) regulation across multiple mouse strains. Metabolites. 2013;3(4):881–911. PubMed PMID: doi:10.3390/metabo3040881.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag London
About this chapter
Cite this chapter
Montoliu, I. (2015). Adopting Multivariate Nonparametric Tools to Determine Genotype-Phenotype Interactions in Health and Disease. In: Kochhar, S., Martin, FP. (eds) Metabonomics and Gut Microbiota in Nutrition and Disease. Molecular and Integrative Toxicology. Springer, London. https://doi.org/10.1007/978-1-4471-6539-2_3
Download citation
DOI: https://doi.org/10.1007/978-1-4471-6539-2_3
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-4471-6538-5
Online ISBN: 978-1-4471-6539-2
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)