Skip to main content

Adopting Multivariate Nonparametric Tools to Determine Genotype-Phenotype Interactions in Health and Disease

  • Chapter
  • First Online:
Metabonomics and Gut Microbiota in Nutrition and Disease

Part of the book series: Molecular and Integrative Toxicology ((MOLECUL))

Abstract

This chapter describes the role of machine learning approaches such as random forests in holistic discovery applications and provides a background for its better understanding. Their suitability for feature selection, data integration, and network modelling are also evaluated through recent examples in the literature. These examples cover a variety of fields, ranging from ecology to metabolomics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Stat Sci. 2003;18(1):71–103.

    Article  Google Scholar 

  2. Shaffer JP. Multiple hypothesis testing. Annu Rev Psychol. 1995;46(1):561–84.

    Article  Google Scholar 

  3. Smith CA, Want EJ, O’Maille G, Abagyan R, Siuzdak G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem. 2006;78(3):779–87.

    Article  CAS  PubMed  Google Scholar 

  4. Nicholson JK, Lindon JC, Holmes E. ‘Metabonomics’: understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. Xenobiotica. 1999;29(11):1181–9.

    Article  CAS  PubMed  Google Scholar 

  5. Fiehn O. Metabolomics – the link between genotypes and phenotypes. Plant Mol Biol. 2002;48(1–2):155–71.

    Article  CAS  PubMed  Google Scholar 

  6. Montoliu I, Genick U, Ledda M, Collino S, Martin FP, Le Coutre J, et al. Current status on genome-metabolome-wide associations: an opportunity in nutrition research. Genes Nutr. 2013;8(1):19–27.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  7. Massart DL, Vandeginste BGM, Buydens LMC, De Jong S, Lewi PJ, Smeyers-Verbeke J. Handbook of chemometrics and qualimetrics. Amsterdam: Elsevier Science B.V.; 1997.

    Google Scholar 

  8. Jolliffe IT. Principal component analysis. New York: Springer; 2002.

    Google Scholar 

  9. Wold S, Sjöström M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst. 2001;58(2):109–30.

    Article  CAS  Google Scholar 

  10. Geladi P, Kowalski BR. Partial least-squares regression: a tutorial. Anal Chim Acta. 1986;185(C):1–17.

    Article  CAS  Google Scholar 

  11. Trygg J, Wold S. Orthogonal projections to latent structures (O-PLS). J Chemom. 2002;16(3):119–28.

    Article  CAS  Google Scholar 

  12. Trygg J, Wold S. O2-PLS, a two-block (X-Y) latent variable regression (LVR) method with an integral OSC filter. J Chemom. 2003;17(1):53–64.

    Article  CAS  Google Scholar 

  13. Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, Velzen EJJ, et al. Assessment of PLSDA cross validation. Metabolomics. 2008;4(1):81–9.

    Article  CAS  Google Scholar 

  14. Barker M, Rayens W. Partial least squares for discrimination. J Chemom. 2003;17(3):166–73.

    Article  CAS  Google Scholar 

  15. Bylesjö M, Rantalainen M, Cloarec O, Nicholson JK, Holmes E, Trygg J. OPLS discriminant analysis: combining the strengths of PLS-DA and SIMCA classification. J Chemom. 2006;20(8–10):341–51.

    Article  Google Scholar 

  16. Westerhuis JA, van Velzen EJJ, Hoefsloot HCJ, Smilde AK. Multivariate paired data analysis: multilevel PLSDA versus OPLSDA. Metabolomics. 2010;6(1):119–28.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  17. Cloarec O, Dumas ME, Craig A, Barton RH, Trygg J, Hudson J, et al. Statistical total correlation spectroscopy: an exploratory approach for latent biomarker identification from metabolic 1H NMR data sets. Anal Chem. 2005;77(5):1282–9.

    Article  CAS  PubMed  Google Scholar 

  18. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.

    Google Scholar 

  19. Hagan MT, Menhaj MB. Training feedforward networks with the Marquardt algorithm. IEEE Trans Neural Netw. 1994;5(6):989–93.

    Article  CAS  PubMed  Google Scholar 

  20. Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46(1–3):389–422.

    Article  Google Scholar 

  21. Breiman L, Friedman J, Olshen R, Stone C. Classification and regression trees. Boca Raton: CRC Press LLC; 1984.

    Google Scholar 

  22. Duda RO, Hart PE, Stork DG. Pattern classification. 2nd ed. New York: Wiley; 2001.

    Google Scholar 

  23. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Data mining, inference and prediction. 2nd ed. New York: Springer; 2009. p. 588.

    Google Scholar 

  24. Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.

    Google Scholar 

  25. Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Ann Stat. 2000;28(2):337–407.

    Article  Google Scholar 

  26. Schapire RE, Singer Y. Improved boosting algorithms using confidence-rated predictions. Mach Learn. 1999;37(3):297–336.

    Article  Google Scholar 

  27. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

    Article  Google Scholar 

  28. Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002;2(3):18–22.

    Google Scholar 

  29. Borg I, Groenen P. Modern multidimensional scaling: theory and applications. New York: Springer Verlag; 2005.

    Google Scholar 

  30. Janitza S, Strobl C, Boulesteix AL. An AUC-based permutation variable importance measure for random forests. BMC Bioinforma. 2013;14:119.

    Article  Google Scholar 

  31. Moutselos K, Maglogiannis I, Chatziioannou A, editors. Heterogeneous data fusion and selection in high-volume molecular and imaging datasets. IEEE 12th conference on Bioinformatics and Bioengineering proceedings 2012;407–412.

    Google Scholar 

  32. Viswanath S, Bloch BN, Rosen M, Chappelow J, Toth R, Rofsky N, et al. Integrating structural and functional imaging for computer assisted detection of prostate cancer on multi-protocol in vivo 3 tesla MRI. SPIE Medical Imaging 2009;7260.

    Google Scholar 

  33. Swatantran A, Dubayah R, Goetz S, Hofton M, Betts MG, Sun M, et al. Mapping migratory bird prevalence using remote sensing data fusion. PLoS ONE. 2012;7(1):e28922.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  34. Latifi H, Nothdurft A, Straub C, Koch B. Modelling stratified forest attributes using optical/LiDAR features in a central European landscape. Int J Digit Earth. 2012;5(2):106–32.

    Article  Google Scholar 

  35. Pang H, Lin A, Holford M, Enerson BE, Lu B, Lawton MP, et al. Pathway analysis using random forests classification and regression. Bioinformatics. 2006;22(16):2028–36.

    Article  CAS  PubMed  Google Scholar 

  36. Acharjee A, Kloosterman B, de Vos RCH, Werij JS, Bachem CWB, Visser RGF, et al. Data integration and network reconstruction with -omics data using Random Forest regression in potato. Anal Chim Acta. 2011;705(1–2):56–63.

    Article  CAS  PubMed  Google Scholar 

  37. Chen Z, Zhang W. Integrative analysis using module-guided random forests reveals correlated genetic factors related to mouse weight. PLoS Comput Biol. 2013;9(3):e1002956.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  38. Tang X, Xiao J, Li Y, Wen Z, Fang Z, Li M. Systematic analysis revealed better performance of random forest algorithm coupled with complex network features in predicting microRNA precursors. Chemom Intell Lab Syst. 2012;118:317–23.

    Article  CAS  Google Scholar 

  39. Lin N, Wu B, Jansen R, Gerstein M, Zhao H. Information assessment on predicting protein-protein interactions. BMC Bioinforma. 2004;5:154.

    Article  Google Scholar 

  40. Lee J, Lee J. Hidden information revealed by optimal community structure from a protein-complex bipartite network improves protein function prediction. PLoS ONE. 2013;8(4):e60372.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  41. Han P, Zhang X, Norton RS, Feng ZP. Large-scale prediction of long disordered regions in proteins using random forests. BMC Bioinforma. 2009;10:8.

    Article  Google Scholar 

  42. Li ZC, Lai YH, Chen LL, Zhou X, Dai Z, Zou XY. Identification of human protein complexes from local sub-graphs of protein-protein interaction network based on random forest with topological structure features. Anal Chim Acta. 2012;718:32–41.

    Article  CAS  PubMed  Google Scholar 

  43. Zheng C, Wang M, Takemoto K, Akutsu T, Zhang Z, Song J. An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins. PLoS ONE. 2012;7(11):e49716.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  44. Muppirala UK, Honavar VG, Dobbs D. Predicting RNA-protein interactions using only sequence information. BMC Bioinforma. 2011;12:489.

    Article  CAS  Google Scholar 

  45. Mehan MR, Nunez-Iglesias J, Dai C, Waterman MS, Zhou XJ. An integrative modular approach to systematically predict gene-phenotype associations. BMC Bioinforma. 2010;11 Suppl 1:S62.

    Article  Google Scholar 

  46. Yang ZR. Predicting sulfotyrosine sites using the random forest algorithm with significantly improved prediction accuracy. BMC Bioinforma. 2009;10:361.

    Article  Google Scholar 

  47. Cao DS, Liang YZ, Deng Z, Hu QN, He M, Xu QS, et al. Genome-scale screening of drug-target associations relevant to Ki using a chemogenomics approach. PLoS ONE. 2013;8(4):e57680.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  48. Heider D, Verheyen J, Hoffmann D. Predicting Bevirimat resistance of HIV-1 from genotype. BMC Bioinforma. 2010;11:37.

    Article  Google Scholar 

  49. Yu H, Chen J, Xu X, Li Y, Zhao H, Fang Y, et al. A systematic prediction of multiple drug-target interactions from chemical, genomic, and pharmacological data. PLoS ONE. 2012;7(5):e37608.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  50. Wang M, Zhao XM, Takemoto K, Xu H, Li Y, Akutsu T, et al. FunSAV: predicting the functional effect of single amino acid variants using a two-stage random forest model. PLoS ONE. 2012;7(8):e43847.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  51. Pesch R, Zimmer R. Complementing the eukaryotic protein interactome. PLoS ONE. 2013;8(6):e66635.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  52. Fernandez-Blanco E, Aguiar-Pulido V, Robert Munteanu C, Dorado J. Random forest classification based on star graph topological indices for antioxidant proteins. J Theor Biol. 2013;317:331–7.

    Article  CAS  PubMed  Google Scholar 

  53. Ko D, Windle B. Enriching for correct prediction of biological processes using a combination of diverse classifiers. BMC Bioinforma. 2011;12:189.

    Article  Google Scholar 

  54. Masso M, Vaisman II. Accurate and efficient gp120 V3 loop structure based models for the determination of HIV-1 co-receptor usage. BMC Bioinforma. 2010;11:494.

    Article  Google Scholar 

  55. Liu S, Chen Y, Wilkins D. Large margin classifiers and random forests for integrated biological prediction. Int J Bioinforma Res Appl. 2012;8(1–2):38–53.

    Article  Google Scholar 

  56. Winham SJ, Colby CL, Freimuth RR, Wang X, de Andrade M, Huebner M, et al. SNP interaction detection with Random Forests in high-dimensional genetic data. BMC Bioinforma. 2012;13:164.

    Article  Google Scholar 

  57. Wu Q, Ye Y, Liu Y, Ng MK. SNP selection and classification of genome-wide SNP data using stratified sampling random forests. IEEE Trans Nanobiosci. 2012;11(3):216–27.

    Article  Google Scholar 

  58. Tripoliti EE, Fotiadis DI, Manis G. Automated diagnosis of diseases based on classification: dynamic determination of the number of trees in random forests algorithm. IEEE Trans Inf Technol Biomed. 2012;16(4):615–22.

    Article  PubMed  Google Scholar 

  59. Robnik-Sikonja M. Improving random forests. 2004.

    Google Scholar 

  60. Tripoliti EE, Fotiadis DI, Manis G. Modifications of the construction and voting mechanisms of the random forests algorithm. Data Knowl Eng. 2013;87:41–65.

    Article  Google Scholar 

  61. Anaissi A, Kennedy PJ, Goyal M, Catchpoole DR. A balanced iterative random forest for gene selection from microarray data. BMC Bioinforma. 2013;14:261.

    Article  Google Scholar 

  62. Xiao Y, Segal MR. Identification of yeast transcriptional regulation networks using multivariate random forests. PLoS Comput Biol. 2009;5(6):e1000414.

    Article  PubMed Central  PubMed  Google Scholar 

  63. Jiang L. Learning random forests for ranking. Front Comput Sci China. 2011;5(1):79–86.

    Article  CAS  Google Scholar 

  64. Bernard S, Adam S, Heutte L. Dynamic random forests. Pattern Recogn Lett. 2012;33(12):1580–6.

    Article  Google Scholar 

  65. Li S, Fedorowicz A, Singh H, Soderholm SC. Application of the random forest method in studies of Local Lymph Node Assay based skin sensitization data. J Chem Inf Model. 2005;45(4):952–64.

    Article  CAS  PubMed  Google Scholar 

  66. Garge NR, Bobashev G, Eggleston B. Random forest methodology for model-based recursive partitioning: the mobForest package for R. BMC Bioinforma. 2013;14:125.

    Article  Google Scholar 

  67. Leistner C, Saffari A, Santner J, Bischof H, editors. Semi-supervised random forests. 2009.

    Google Scholar 

  68. Zeng JY, Cao XH, Gan JY. An improvement of AdaBoost for face detection with random forests. ed. CCIS; 2010;93: 22–9.

    Google Scholar 

  69. Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. Inferring regulatory networks from expression data using tree-based methods. PLoS ONE. 2010;5(9):e12776.

    Article  PubMed Central  PubMed  Google Scholar 

  70. Chang JS, Yeh RF, Wiencke JK, Wiemels JL, Smirnov I, Pico AR, et al. Pathway analysis of single-nucleotide polymorphisms potentially associated with glioblastoma multiforme susceptibility using random forests. Cancer Epidemiol Biomarkers Prev. 2008;17(6):1368–73.

    Article  CAS  PubMed  Google Scholar 

  71. Chung RH, Chen YE. A two-stage random forest-based pathway analysis method. PLoS ONE. 2012;7(5):e36662.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  72. Pang H, Zhao H. Building pathway clusters from Random Forests classification using class votes. BMC Bioinforma. 2008;9:87.

    Article  Google Scholar 

  73. Collino S, Martin F-P, Montoliu I, Barger J, Da Silva L, Prolla T, et al. Transcriptomics and metabonomics identify essential metabolic signatures in calorie restriction (CR) regulation across multiple mouse strains. Metabolites. 2013;3(4):881–911. PubMed PMID: doi:10.3390/metabo3040881.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ivan Montoliu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag London

About this chapter

Cite this chapter

Montoliu, I. (2015). Adopting Multivariate Nonparametric Tools to Determine Genotype-Phenotype Interactions in Health and Disease. In: Kochhar, S., Martin, FP. (eds) Metabonomics and Gut Microbiota in Nutrition and Disease. Molecular and Integrative Toxicology. Springer, London. https://doi.org/10.1007/978-1-4471-6539-2_3

Download citation

Publish with us

Policies and ethics