Statistical Models to Explore the Exposome: From OMICs Profiling to ‘Mechanome’ Characterization

  • Marc Chadeau-HyamEmail author
  • Roel Vermeulen


Over the past decade, high-resolution molecular profiles using OMICS technologies have accumulated and have given rise to an unprecedented source of information to explore the effective biological effects of external stressors and to detect drivers of subsequent disease risk. Although the volume, dimensionality, and complexity of OMICs data are constantly increasing, several methods enabling their analysis are now available. The exploration of these data relies on statistical approaches including univariate models coupled with multiple testing correction, dimensionality reduction techniques, and variable selection approaches. While these methods are established, their application in an exposome context is raising specific methodological challenges. In addition, the isolated exploration of an OMIC profile offers the possibility to capture stressor-induced biological/biochemical alterations, potentially impacting individual risk profiles, but this may only yield a fractional picture of the complex molecular events involved, therefore limiting our understanding of the effective mechanisms mediating the effect of the exposome. Despite efficient developments over systems biological approaches, such integrations remain at best data-specific, usually disease-specific, and more systematically restricted to the exploration of (few) predefined hypotheses. The challenging task of exploring the ‘mechanome’ as defined by the ensemble of stressor-induced molecular mechanisms occurring throughout the life course and determining the individual’s risk of developing adverse conditions can be decomposed in three interdependent streams focusing on (1) OMICs profiling, (2) OMICs data integration, and (3) the exploration of molecular mechanisms involved in the exposure effect mediation towards (chronic) disease development.


Statistical models Omics Mechanome Bioinformatics 


  1. Agier L, Portengen L, Chadeau-Hyam M, Basagana X, Giorgis-Allemand L, Siroux V, Robinson O, Vlaanderen J, Gonzalez JR, Nieuwenhuijsen MJ, Vineis P, Vrijheid M, Slama R, Vermeulen R (2016) A systematic comparison of linear regression-based statistical methods to assess exposome-health associations. Environ Health Perspect 124(12):1848–1856. Scholar
  2. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. Nat Genet 25:25. Scholar
  3. Assi N, Fages A, Vineis P, Chadeau-Hyam M, Stepien M, Duarte-Salles T, Byrnes G, Boumaza H, Knüppel S, Kühn T, Palli D, Bamia C, Boshuizen H, Bonet C, Overvad K, Johansson M, Travis R, Gunter M, Lund E, Dossus L, Elena-Herrmann B, Riboli E, Jenab M, Viallon V, Ferrari P (2015) A statistical framework to model the meeting-in-the-middle principle using metabolomic data: application to hepatocellular carcinoma in the EPIC study. Mutagenesis 30(6):743–753Google Scholar
  4. Balding DJ (2006) A tutorial on statistical methods for population association studies. Nat Rev Genet 7(10):781–791. Scholar
  5. Belshaw NJ, Pal N, Tapp HS, Dainty JR, Lewis MPN, Williams MR, Lund EK, Johnson IT (2010) Patterns of DNA methylation in individual colonic crypts reveal aging and cancer-related field defects in the morphologically normal mucosa. Carcinogenesis 31(6):1158–1163. Scholar
  6. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol 57:289–300Google Scholar
  7. Bottolo L, Chadeau-Hyam M, Hastie DI, Langley SR, Petretto E, Tiret L, Tregouet D, Richardson S (2011) ESS++: a C++ objected-oriented algorithm for Bayesian stochastic search model exploration. Bioinformatics 27(4):587–588. Scholar
  8. Bottolo L, Chadeau-Hyam M, Hastie DI, Zeller T, Liquet B, Newcombe P, Yengo L, Wild PS, Schillert A, Ziegler A, Nielsen SF, Butterworth AS, Ho WK, Castagne R, Munzel T, Tregouet D, Falchi M, Cambien F, Nordestgaard BG, Fumeron F, Tybjaerg-Hansen A, Froguel P, Danesh J, Petretto E, Blankenberg S, Tiret L, Richardson S (2013) GUESS-ing polygenic associations with multiple phenotypes using a GPU-based evolutionary stochastic search algorithm. PLoS Genet 9(8):e1003657. Scholar
  9. Boulesteix AL, Strimmer K (2007) Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Brief Bioinform 8(1):32–44. Scholar
  10. Carlin DJ, Rider CV, Woychik R, Birnbaum LS (2013) Unraveling the health effects of environmental mixtures: an NIEHS priority. Environ Health Perspect 121(1):A6–A8CrossRefGoogle Scholar
  11. Castagne R, Kelly-Irving M, Campanella G, Guida F, Krogh V, Palli D, Panico S, Sacerdote C, Tumino R, Kleinjans J, de Kok T, Kyrtopoulos SA, Lang T, Stringhini S, Vermeulen R, Vineis P, Delpierre C, Chadeau-Hyam M (2016) Biological marks of early-life socioeconomic experience is detected in the adult inflammatory transcriptome. Sci Rep 6:38705. Scholar
  12. Castagne R, Boulange CL, Karaman I, Campanella G, Santos Ferreira DL, Kaluarachchi MR, Lehne B, Moayyeri A, Lewis MR, Spagou K, Dona AC, Evangelos V, Tracy R, Greenland P, Lindon JC, Herrington D, Ebbels TMD, Elliott P, Tzoulaki I, Chadeau-Hyam M (2017) Improving visualization and interpretation of metabolome-wide association studies: an application in a population-based cohort using untargeted 1h nmr metabolic profiling. J Proteome Res 16(10):3623–3633. Scholar
  13. Chadeau-Hyam M, Ebbels TMD, Brown IJ, Chan Q, Stemler J, Huang CC, Daviglus ML, Ueshima H, Zhao L, Holmes E, Nicholson JK, Elliott P, De Iorio M (2010) Metabolic profiling and the metabolome-wide association study: significance level for biomarker identification. J Proteome Res 9(9):4620–4627. Scholar
  14. Chadeau-Hyam M, Athersuch TJ, Keun HC, De Iorio M, Ebbels TMD, Jenab M, Sacerdote C, Bruce SJ, Holmes E, Vineis P (2011) Meeting-in-the-middle using metabolic profiling - a strategy for the identification of intermediate biomarkers in cohort studies. Biomarkers 16(1):83–88. Scholar
  15. Chadeau-Hyam M, Campanella G, Jombart T, Bottolo L, Portengen L, Vineis P, Liquet B, Vermeulen RC (2013) Deciphering the complex: methodological overview of statistical models to derive OMICS-based biomarkers. Environ Mol Mutagen 54(7):542–557. Scholar
  16. Chadeau-Hyam M, Tubert-Bitter P, Guihenneuc-Jouyaux C, Campanella G, Richardson S, Vermeulen R, De Iorio M, Galea S, Vineis P (2014a) Dynamics of the risk of smoking-induced lung cancer: a compartmental hidden Markov model for longitudinal analysis. Epidemiology 25(1):28–34. Scholar
  17. Chadeau-Hyam M, Vermeulen RC, Hebels DG, Castagne R, Campanella G, Portengen L, Kelly RS, Bergdahl IA, Melin B, Hallmans G, Palli D, Krogh V, Tumino R, Sacerdote C, Panico S, de Kok TM, Smith MT, Kleinjans JC, Vineis P, Kyrtopoulos SA, EnviroGenoMarkers project consortium (2014b) Prediagnostic transcriptomic markers of chronic lymphocytic leukemia reveal perturbations 10 years before diagnosis. Ann Oncol 25(5):1065–1072. Scholar
  18. Chun H, Keles S (2009) Expression quantitative trait loci mapping with multivariate sparse partial least squares regression. Genetics 182(1):79–90. Scholar
  19. Chun H, Keles S (2010) Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J R Stat Soc Series B Stat Methodol 72:3–25CrossRefGoogle Scholar
  20. Dominici F, Peng RD, Barr CD, Bell ML (2010) Protecting human health from air pollution: shifting from a single-pollutant to a multipollutant approach. Epidemiology 21(2):187–194. Scholar
  21. Dudbridge F, Gusnanto A (2008) Estimation of significance thresholds for genomewide association scans. Genet Epidemiol 32(3):227–234. Scholar
  22. Espin-Perez A, Font-Ribera L, van Veldhoven K, Krauskopf J, Portengen L, Chadeau-Hyam M, Vermeulen R, Grimalt JO, Villanueva CM, Vineis P, Kogevinas M, Kleinjans JC, de Kok TM (2018) Blood transcriptional and microRNA responses to short-term exposure to disinfection by-products in a swimming pool. Environ Int 110:42–50. Scholar
  23. Fasoli M, Dal Santo S, Zenoni S, Tornielli GB, Farina L, Zamboni A, Porceddu A, Venturini L, Bicego M, Murino V, Ferrarini A, Delledonne M, Pezzotti M (2012) The grapevine expression atlas reveals a deep transcriptome shift driving the entire plant into a maturation program. Plant Cell 24(9):3489–3505. Scholar
  24. Font-Ribera L, Kogevinas M, Zock JP, Gomez FP, Barreiro E, Nieuwenhuijsen MJ, Fernandez P, Lourencetti C, Perez-Olabarria M, Bustamante M, Marcos R, Grimalt JO, Villanueva CM (2010) Short-term changes in respiratory biomarkers after swimming in a chlorinated pool. Environ Health Perspect 118(11):1538–1544. Scholar
  25. Fonville JM, Richards SE, Barton RH, Boulange CL, Ebbels TMD, Nicholson JK, Holmes E, Dumas ME (2010) The evolution of partial least squares models and related chemometric approaches in metabonomics and metabolic phenotyping. J Chemom 24(11–12):636–649. Scholar
  26. Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22CrossRefGoogle Scholar
  27. Greenacre M (1984) Theory and applications of correspondence analysis. Academic Press, LondonGoogle Scholar
  28. Guan YT, Stephens M (2011) Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann Appl Stat 5(3):1780–1815. Scholar
  29. Guida F, Sandanger TM, Castagne R, Campanella G, Polidoro S, Palli D, Krogh V, Tumino R, Sacerdote C, Panico S, Severi G, Kyrtopoulos SA, Georgiadis P, Vermeulen RCH, Lund E, Vineis P, Chadeau-Hyam M (2015) Dynamics of smoking-induced genome-wide methylation changes with time since smoking cessation. Hum Mol Genet 24(8):2349–2359. Scholar
  30. Guxens M, Ballester F, Espada M, Fernandez MF, Grimalt JO, Ibarluzea J, Olea N, Rebagliato M, Tardon A, Torrent M, Vioque J, Vrijheid M, Sunyer J, Project I (2012) Cohort profile: the INMA--INfancia y Medio Ambiente--(environment and childhood) project. Int J Epidemiol 41(4):930–940. Scholar
  31. Haight TJ, Wang Y, van der Laan MJ, Tager IB (2010) A cross-validation deletion-substitution-addition model selection algorithm: application to marginal structural models. Comput Stat Data Anal 54(12):3080–3094. Scholar
  32. Hans C, Dobra A, West M (2007) Shotgun stochastic search for “large p” regression. J Am Stat Assoc 102(478):507–516. Scholar
  33. Hoerl AE, Kennard RW (1970) Ridge regression—biased estimation for nonorthogonal problems. Technometrics 12(1):661–676. Scholar
  34. Hoggart CJ, Clark TG, De Lorio M, Whittaker JC, Balding DJ (2008) Genome-wide significance for dense SNP and resequencing data. Genet Epidemiol 32(2):179–185CrossRefGoogle Scholar
  35. Holmes E, Loo RL, Stamler J, Bictash M, Yap IK, Chan Q, Ebbels T, De Iorio M, Brown IJ, Veselkov KA, Daviglus ML, Kesteloot H, Ueshima H, Zhao L, Nicholson JK, Elliott P (2008) Human metabolic phenotype diversity and its association with diet and blood pressure. Nature 453(7193):396–400CrossRefGoogle Scholar
  36. Hotelling H (1933a) Analysis of complex statistical variables into principal components. J Educ Psychol 24(6):417–441CrossRefGoogle Scholar
  37. Hotelling H (1933b) Analysis ofc omplex statistical variables into principal components. J Educ Psychol 24(7):498–520CrossRefGoogle Scholar
  38. Huang DW, Sherman BT, Lempicki RA (2008) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4:44. Scholar
  39. Jain P, Vineis P, Liquet B, Vlaanderen J, Bodinier B, van Veldhoven K, Kogevinas M, Athersuch TJ, Font-Ribera L, Villanueva CM, Vermeulen R, Chadeau-Hyam M (2018) A multivariate approach to investigate the combined biological effects of multiple exposures. J Epidemiol Community Health 72(7):564–571. Scholar
  40. Jombart T, Pontier D, Dufour AB (2009) Genetic markers in the playground of multivariate analysis. Heredity 102(4):330–341. Scholar
  41. Kivelä M, Arenas A, Barthelemy M, Gleeson J, Moreno Y, Porter M (2013) Multilayer networks. J Complex Netw 2(3):203–271CrossRefGoogle Scholar
  42. Le Cao KA, Rossouw D, Robert-Granie C, Besse P (2008) A sparse PLS for variable selection when integrating omics data. Stat Appl Genet Mol Biol 7(1):35Google Scholar
  43. Le Cao KA, Martin PGP, Robert-Granie C, Besse P (2009) Sparse canonical methods for biological data integration: application to a cross-platform study. BMC Bioinformatics 10:34. Scholar
  44. Li S, Park Y, Duraisingham S, Strobel FH, Khan N, Soltow QA, Jones DP, Pulendran B (2013) Predicting network activity from high throughput metabolomics. PLoS Comput Biol 9(7):e1003123. Scholar
  45. Lindstrom MJ, Bates DM (1990) Nonlinear mixed effects models for repeated measures data. Biometrics 46(3):673–687. Scholar
  46. Liquet B, Le Cao K-A, Hocini H, Thiebaut R (2012) A novel approach for biomarker selection and the integration of repeated measures experiments from two assays. BMC Bioinformatics 13(1):325CrossRefGoogle Scholar
  47. Liquet B, Bottolo L, Campanella G, Richardson S, Chadeau-Hyam M (2016a) R2GUESS: a graphics processing unit-based R package for Bayesian variable selection regression of multivariate responses. J Stat Softw 69(2).
  48. Liquet B, Lafaye de Micheaux P, Hejblum B, Thiebaut R (2016b) Group and sparse group partial least square approaches applied in genomics context. Bioinformatics 32(1):35–42Google Scholar
  49. McCreanor J, Cullinan P, Nieuwenhuijsen MJ, Stewart-Evans J, Malliarou E, Jarup L, Harrington R, Svartengren M, Han IK, Ohman-Strickland P, Chung KF, Zhang J (2007) Respiratory effects of exposure to diesel traffic in persons with asthma. N Engl J Med 357(23):2348–2358. Scholar
  50. McHale CM, Zhang LP, Lan Q, Vermeulen R, Li GL, Hubbard AE, Porter KE, Thomas R, Portier CJ, Shen M, Rappaport SM, Yin SN, Smith MT, Rothman N (2011) Global gene expression profiling of a population exposed to a range of benzene levels. Environ Health Perspect 119(5):628–634. Scholar
  51. Musumarra G, Condorelli DF, Fortuna CG (2011) OPLS-DA as a suitable method for selecting a set of gene transcripts discriminating RAS- and PTPN11-mutated cells in acute lymphoblastic leukaemia. Comb Chem High Throughput Screen 14(1):36–46CrossRefGoogle Scholar
  52. Parkhomenko E, Tritchler D, Beyene J (2009) Sparse canonical correlation analysis with application to genomic data integration. Stat Appl Genet Mol Biol 8:1. Scholar
  53. Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS Genet 2(12):e190. Scholar
  54. Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2(6):559–572CrossRefGoogle Scholar
  55. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8):904–909. Scholar
  56. Rappaport SM, Smith MT (2010) Environment and disease risks. Science 330(6003):460–461. Scholar
  57. Reich D, Price AL, Patterson N (2008) Principal component analysis of genetic data. Nat Genet 40(5):491–492. Scholar
  58. Rider CV, Carlin DJ, Devito MJ, Thompson CL, Walker NJ (2013) Mixtures research at NIEHS: an evolving program. Toxicology 313(2–3):94–102. Scholar
  59. Robinson O, Basagana X, Agier L, de Castro M, Hernandez-Ferrer C, Gonzalez JR, Grimalt JO, Nieuwenhuijsen M, Sunyer J, Slama R, Vrijheid M (2015) The pregnancy exposome: multiple environmental exposures in the INMA-Sabadell birth cohort. Environ Sci Technol 49(17):10632–10641. Scholar
  60. Salamanca Beatriz V, Ebbels Timothy MD, Iorio Maria D (2014) Variance and covariance heterogeneity analysis for detection of metabolites associated with cadmium exposure. Stat Appl Genet Mol Biol 13:191–201. Scholar
  61. Schafer J, Strimmer K (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 4:32. Scholar
  62. Shen HP, Huang JHZ (2008) Sparse principal component analysis via regularized low rank matrix approximation. J Multivar Anal 99(6):1015–1034. Scholar
  63. Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for cox’s proportional hazards model via coordinate descent. J Stat Softw 39(5):1–13CrossRefGoogle Scholar
  64. Simon N, Friedman J, Hastie T, Tibshirani R (2013) A sparse-group lasso. J Comput Graph Stat 22(2):231–245. Scholar
  65. The Gene Ontology Consortium (2017) Expansion of the gene ontology knowledgebase and resources. Nucleic Acids Res 45(D1):D331–D338. Scholar
  66. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol 58(1):267–288. Scholar
  67. Valcarcel B, Wurtz P, al Basatena NKS, Tukiainen T, Kangas AJ, Soininen P, Jarvelin MR, Ala-Korpela M, Ebbels TM, de Iorio M (2011) A differential network approach to exploring differences between biological states: an application to prediabetes. PLoS One 6(9):e24702. Scholar
  68. Valcarcel B, Ebbels TMD, Kangas AJ, Soininen P, Elliot P, Ala-Korpela M, Jarvelin MR, de Iorio M (2014) Genome metabolome integrated network analysis to uncover connections between genetic variants and complex traits: an application to obesity. J R Soc Interface 11(94):20130908. Scholar
  69. van Veldhoven K, Keski-Rahkonen P, Barupal DK, Villanueva CM, Font-Ribera L, Scalbert A, Bodinier B, Grimalt JO, Zwiener C, Vlaanderen J, Portengen L, Vermeulen R, Vineis P, Chadeau-Hyam M, Kogevinas M (2017) Effects of exposure to water disinfection by-products in a swimming pool: a metabolome-wide association study. Environ Int 111:60–70. Scholar
  70. Vineis P, Perera F (2007) Molecular epidemiology and biomarkers in etiologic cancer research: the new in light of the old. Cancer Epidemiol Biomark Prev 16(10):1954–1965CrossRefGoogle Scholar
  71. Vineis P, Chadeau-Hyam M, Gmuender H, Gulliver J, Herceg Z, Kleinjans J, Kogevinas M, Kyrtopoulos S, Nieuwenhuijsen M, Phillips DH, Probst-Hensch N, Scalbert A, Vermeulen R, Wild CP (2016) The exposome in practice: design of the EXPOsOMICS project. Int J Hyg Environ Health 220(2 Pt A):142–151. Scholar
  72. Vlaanderen J, van Veldhoven K, Font-Ribera L, Villanueva CM, Chadeau-Hyam M, Portengen L, Grimalt JO, Zwiener C, Heederik D, Zhang X, Vineis P, Kogevinas M, Vermeulen R (2017) Acute changes in serum immune markers due to swimming in a chlorinated pool. Environ Int 105:1–11. Scholar
  73. Wang H, Gottfries J, Barrenäs F, Benson M (2011) Identification of novel biomarkers in seasonal allergic rhinitis by combining proteomic, multivariate and pathway analysis. PLoS One 6(8):e23563. Scholar
  74. West M (2003) Bayesian factor regression models in the “large p, small n” paradigm. Bayesian statistics 7. Clarendon Press, OxfordGoogle Scholar
  75. Westfall P, Young S (1993) Resampling-based multiple testing: examples and methods for p-value adjustment (Wiley Series in Probability and Statistics). Wiley-InterscienceGoogle Scholar
  76. Wild CP (2005) Complementing the genome with an ‘exposome’: the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomark Prev 14(8):1847–1850. Scholar
  77. Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3):515–534. Scholar
  78. Wold S, Ruhe A, Wold H, Dunn WJ (1984) The collinearity problem in linear-regression - the partial least-squares (PLS) approach to generalized inverses. SIAM J Sci Stat Comput 5(3):735–743. Scholar
  79. Yap IKS, Brown IJ, Chan Q, Wijeyesekera A, Garcia-Perez I, Bictash M, Loo RL, Chadeau-Hyam M, Ebbels T, Iorio MD, Maibaum E, Zhao L, Kesteloot H, Daviglus ML, Stamler J, Nicholson JK, Elliott P, Holmes E (2010) Metabolome-wide association study identifies multiple biomarkers that discriminate north and south chinese populations at differing risks of cardiovascular disease: INTERMAP study. J Proteome Res 9(12):6647–6654. Scholar
  80. Zhou H, Sehl ME, Sinsheimer JS, Lange K (2010) Association screening of common and rare genetic variants by penalized regression. Bioinformatics 26(19):2375–2382. Scholar
  81. Zou F, Fine JP, Hu J, Lin DY (2004) An efficient resampling method for assessing genome-wide statistical significance in mapping quantitative trait loci. Genetics 168(4):2307–2316. Scholar
  82. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol 67(2):301–320. Scholar
  83. Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J Comput Graph Stat 15(2):265–286. Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2019

Authors and Affiliations

  1. 1.MRC/PHE Centre for Environment and Health, Department of Epidemiology and BiostatisticsSchool of Public Health, Imperial College LondonLondonUK
  2. 2.Institute for Risk Assessment Sciences (IRAS), Utrecht UniversityUtrechtThe Netherlands
  3. 3.Department of Molecular Epidemiology, Julius CenterUniversity Medical Center UtrechtUtrechtThe Netherlands

Personalised recommendations