Variable Selection for High Dimensional Metagenomic Data
We address the high dimensional variable selection problem for associating the microbial compositions with a phenotype such as body mass index and disease status. Due to various sequencing depth, the number of reads assigned to a species or an operational taxonomic unit (OTU) is not directly comparable across different samples. Usually rarefying or normalization of the metagenomic count data has to be done before performing the downstream analysis. In this chapter, we employ a log contrast model bypassing the need for normalization. We propose a new method to identify phenotype associated species or OTUs using penalized regression and stability selection. The proposed method can also be applied to variable selection for regression analysis with compositional covariates. We compare the performance of different methods through simulation studies and real data analysis in the field of metagenomics.
- Aitchison J.: The Statistical Analysis of Compositional Data. Chapman & Hall, London. Reprinted in 2003, with additional material, by The Blackburn Press (1986)Google Scholar
- Bragg, L., Tyson, G.W.: Metagenomics using next-generation sequencing. In: Paulsen, I., Holmes, A. (eds.) Environmental Microbiology. Methods in Molecular Biology (Methods and Protocols), vol. 1096. Humana Press, Totowa, NJ (2014)Google Scholar
- Furnari, M.E., Savarino, L.B., Moscatelli, A., Gemignani, L., Giannini, E.G., Zentilin, P.: Reassessment of the role of methane production between irritable bowel syndrome and functional constipation. J. Gastroenterol. Liver Dis. 21, 157–163 (2012)Google Scholar