Abstract
Background: In recent years, metagenomics, as a combination of research techniques without the process of cultivation, has become more and more popular in studying the genomic/genetic variation of microbes in environmental or clinical samples. Though generated from similar sequencing technologies, there is increasing evidence that metagenomic sequence data may not be treated as another variant of RNA-Seq count data, especially due to its compositional characteristics. While it is often of primary interest to compare taxonomic or functional profiles of microbial communities between conditions, normalization for library size is usually an inevitable step prior to a typical differential abundance analysis. Some methods have been proposed for such normalization. But the existing performance evaluation of normalization methods for metagenomic sequence data does not adequately consider the compositional characteristics.
Result: The normalization methods assessed in this chapter include Total Sum Scaling (TSS), Relative Log Expression (RLE), Trimmed Mean of M-value (TMM), Cumulative Sum Scaling (CSS), and Rarefying (RFY). In addition to compositional proportions, simulated data were generated with consideration of overdispersion, zero inflation, and under-sampling issue. The impact of normalization on subsequent differential abundance analysis was further studied.
Conclusion: Selection of a normalization method for metagenomic compositional data should be made on a case-by-case basis. Simulation using the parameters learned from the experimental data may be carried out to assist the selection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anders, S., & Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11(10), R106.
Anders, S., et al. (2013). Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature Protocols, 8(9), 1765–1786.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289–300.
Bragg, L., & Tyson, G. W. (2014). Metagenomics using next-generation sequencing. Environmental Microbiology: Methods and Protocols, 1096, 183–201.
Bullard, J. H., et al. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11(1), 94.
Caporaso, J. G., et al. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7(5), 335–336.
Cole, J. R., et al. (2013). Ribosomal Database Project: Data and tools for high throughput rRNA analysis. Nucleic Acids Research, 42(D1), D633–D642.
Costea, P. I., et al. (2014). A fair comparison. Nature Methods, 11(4), 359.
Dillies, M.-A., et al. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, 14(6), 671–683.
Fernandes, A. D., et al. (2014). Unifying the analysis of high-throughput sequencing datasets: Characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1), 15.
Gloor, G. B., et al. (2016). It’s all relative: Analyzing microbiome data as compositions. Annals of Epidemiology, 26(5), 322–329.
Handelsman, J. (2004). Metagenomics: Application of genomics to uncultured microorganisms. Microbiology and Molecular Biology Reviews, 68(4), 669–685.
Johnson, S., et al. (2014). A better sequence-read simulator program for metagenomics. BMC Bioinformatics, 15(9), S14.
Mandal, S., et al. (2015). Analysis of composition of microbiomes: A novel method for studying microbial composition. Microbial Ecology in Health and Disease, 26(1), 27663.
McMurdie, P. J., & Holmes, S. (2013). phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PloS One, 8(4), e61217.
McMurdie, P. J., & Holmes, S. (2014). Waste not, want not: Why rarefying microbiome data is inadmissible. PLoS Computational Biology, 10(4), e1003531.
Metzker, M. L. (2010). Sequencing technologies—The next generation. Nature Reviews Genetics, 11(1), 31–46.
National Research Council. (2007). The new science of metagenomics: Revealing the secrets of our microbial planet. Washington, DC: National Academies Press.
Paulson, J. N., et al. (2013). Differential abundance analysis for microbial marker-gene surveys. Nature Methods, 10(12), 1200–1202.
Paulson, J. N., Bravo, H. C., & Pop, M. (2014). Reply to: “A fair comparison”. Nature methods, 11(4), 359–360.
Peterson, J., et al. (2009). The NIH human microbiome project. Genome Research, 19(12), 2317–2323.
Powell, S., et al. (2014). eggNOG v4. 0: Nested orthology inference across 3686 organisms. Nucleic Acids Research, 42(D1), D231–D239.
Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3), R25.
Shreiner, A. B., Kao, J. Y., & Young, V. B. (2015). The gut microbiome in health and in disease. Current Opinion in Gastroenterology, 31(1), 69.
Sohn, M. B., Du, R., & An, L. (2015). A robust approach for identifying differentially abundant features in metagenomic samples. Bioinformatics, 31(14), 2269–2275.
Srinivas, G., et al. (2013). Genome-wide mapping of gene–microbiota interactions in susceptibility to autoimmune skin blistering. Nature Communications, 4, 2462.
Tatusov, R. L., et al. (2003). The COG database: An updated version includes eukaryotes. BMC Bioinformatics, 4(1), 1.
Tsilimigras, M. C., & Fodor, A. A. (2016). Compositional data analysis of the microbiome: Fundamentals, tools, and challenges. Annals of Epidemiology, 26(5), 330–335.
Turnbaugh, P. J., et al. (2009). The effect of diet on the human gut microbiome: A metagenomic analysis in humanized gnotobiotic mice. Science Translational Medicine, 1(6), 6ra14.
Wang, Q., et al. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261–5267.
Weiss, S., et al. (2017). Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome, 5(1), 27.
White, J. R., Nagarajan, N., & Pop, M. (2009). Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Computational Biology, 5(4), e1000352.
Woese, C. R. (1987). Bacterial evolution. Microbiological Reviews, 51(2), 221.
Wooley, J. C., Godzik, A., & Friedberg, I. (2010). A primer on metagenomics. PLoS Computational Biology, 6(2), e1000667.
Yang, Y. H., et al. (2002). Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research, 30(4), e15.
Acknowledgements
The authors are grateful to two anonymous reviewers for their careful reading of the manuscript and their comments and suggestions. ZF’s research is supported by grant U54 GM104940 from the National Institute of General Medical Sciences of the National Institutes of Health, which funds the Louisiana Clinical and Translational Science Center of Pennington Biomedical Research Center. LA’s research is partially supported by National Science Foundation [DMS-1222592] and United States Department of Agriculture [Hatch project, ARZT-1360830-H22-138]. RD’s research was supported in part by the UNM Comprehensive Cancer Center, a recipient of NCI Cancer Support Grant 2 P30 CA118100-11 (PI: Cheryl L. Willman, MD).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 Supplementary Data Distribution
Negative Binomial Distribution. A NB distribution is defined as,
where r and p are two parameters, and r is called size parameter. The mean of the NB distribution is,
and the variance is,
Thus, r indicates the level of overdispersion in the counts.
1.2 Supplementary Illustration of TMM and RLE with Compositional Dataset
For gene expression studies, there is a widely used assumption that the majority of genes do not express differentially between conditions. Many of RNA-Seq normalization methods were developed based on this assumption, including TMM and RLE. The “non-differential” in the assumption is implemented as non-differential absolute abundance after normalization. Subsequent differential analysis is also to compare the normalized counts between conditions, instead of comparing the relative abundances as it is for compositional data.
Focusing on the essence of a normalization procedure, the hypothetical datasets are made of the expectations of counts. For TMM approach, the logarithm function and the weighted sum are not applied since those are designed for reducing the effect of count variation. In Fig. 16.5, the relative abundance ratio, compared to the first sample, is first calculated from the raw counts, i.e., \( \frac{y_{ij}/{\sum}_i{y}_{ij}}{y_{i1}/{\sum}_i{y}_{i1}} \). The trimmed mean of the ratios for each sample, after trimming the largest and smallest values, is used as the scale factor. The true scale factor is 2.6 (390/150), but the output from TMM is 1.73. Figure 16.5 shows a very likely situation for metagenomic compositional data, in which the relative abundances vary largely between conditions. TMM may not work well for such data since it merely relies on the assumption that after normalization most of features should share the same absolute abundance.
For RLE normalization, the geometric mean of the counts to each feature from all the samples is first calculated, see Fig. 16.6. Next, the ratio of a raw count over the mean count for the same feature is computed. The scale factor for a sample is obtained as the median of the ratios for the sample. For this hypothetic dataset, RLE approach does not suggest any normalization adjustment since all the scale factors equal to 1; however, the true library sizes are very different (e.g., 210 vs. 310). In Fig. 16.6, it is clear that the scale factor is determined by the absolute count of Feature 2, 3, 4, or 5, instead of the relative abundance of one of those features. A subsequently comparative analysis would reach the conclusion that there is no differential abundance for Feature 2, 3, 4, or 5 between the conditions. However, the relative abundances of the features have altered, for Feature 4 it is 14% and 10% under the two conditions, respectively.
1.3 Supplementary Example
Mouse stool metagenomic data. Fresh or frozen adult human fecal microbial communities were transplanted into guts of germ-free C57BL/6J mice. Here, germ-free environment is referred to as mice gut that does not previously expose to microbes. Following the transplanting, 12 recipient mice were fed with a standard low-fat, plant polysaccharide-rich diet for 4 weeks; after that, six mice were switched to take high-fat/high-sugar Western diet for another 6 weeks. Amplification and pyrosequencing of V2 region of 16S rRNA genes were performed periodically to record the changes of microbial community structure of fecal samples of the mice (Turnbaugh et al. 2009). There are 85 samples under condition one (associated to low-fat diet fed mice), and 54 samples under condition two (associated to Western diet fed mice). The bioinformatic tool RDP (Wang et al. 2007) was used to generate the count data, which is featured at species level. Together, there are 52 genera shown under both conditions, and the data is considered to represent low complex metagenomic data. Figure 16.7 demonstrates that RLE and RFY should not be recommended for normalization of the metagenomic data.
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Du, R., An, L., Fang, Z. (2018). Performance Evaluation of Normalization Approaches for Metagenomic Compositional Data on Differential Abundance Analysis. In: Zhao, Y., Chen, DG. (eds) New Frontiers of Biostatistics and Bioinformatics. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-99389-8_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-99389-8_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99388-1
Online ISBN: 978-3-319-99389-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)