Performance Evaluation of Normalization Approaches for Metagenomic Compositional Data on Differential Abundance Analysis

Du, Ruofei; An, Lingling; Fang, Zhide

doi:10.1007/978-3-319-99389-8_16

Ruofei Du⁵,
Lingling An^6,7 &
Zhide Fang⁸

Part of the book series: ICSA Book Series in Statistics ((ICSABSS))

1257 Accesses
3 Citations

Abstract

Background: In recent years, metagenomics, as a combination of research techniques without the process of cultivation, has become more and more popular in studying the genomic/genetic variation of microbes in environmental or clinical samples. Though generated from similar sequencing technologies, there is increasing evidence that metagenomic sequence data may not be treated as another variant of RNA-Seq count data, especially due to its compositional characteristics. While it is often of primary interest to compare taxonomic or functional profiles of microbial communities between conditions, normalization for library size is usually an inevitable step prior to a typical differential abundance analysis. Some methods have been proposed for such normalization. But the existing performance evaluation of normalization methods for metagenomic sequence data does not adequately consider the compositional characteristics.

Result: The normalization methods assessed in this chapter include Total Sum Scaling (TSS), Relative Log Expression (RLE), Trimmed Mean of M-value (TMM), Cumulative Sum Scaling (CSS), and Rarefying (RFY). In addition to compositional proportions, simulated data were generated with consideration of overdispersion, zero inflation, and under-sampling issue. The impact of normalization on subsequent differential abundance analysis was further studied.

Conclusion: Selection of a normalization method for metagenomic compositional data should be made on a case-by-case basis. Simulation using the parameters learned from the experimental data may be carried out to assist the selection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anders, S., & Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11(10), R106.
Article Google Scholar
Anders, S., et al. (2013). Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature Protocols, 8(9), 1765–1786.
Article Google Scholar
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289–300.
Google Scholar
Bragg, L., & Tyson, G. W. (2014). Metagenomics using next-generation sequencing. Environmental Microbiology: Methods and Protocols, 1096, 183–201.
Article Google Scholar
Bullard, J. H., et al. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11(1), 94.
Article Google Scholar
Caporaso, J. G., et al. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7(5), 335–336.
Article Google Scholar
Cole, J. R., et al. (2013). Ribosomal Database Project: Data and tools for high throughput rRNA analysis. Nucleic Acids Research, 42(D1), D633–D642.
Article Google Scholar
Costea, P. I., et al. (2014). A fair comparison. Nature Methods, 11(4), 359.
Article Google Scholar
Dillies, M.-A., et al. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, 14(6), 671–683.
Article Google Scholar
Fernandes, A. D., et al. (2014). Unifying the analysis of high-throughput sequencing datasets: Characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1), 15.
Article MathSciNet Google Scholar
Gloor, G. B., et al. (2016). It’s all relative: Analyzing microbiome data as compositions. Annals of Epidemiology, 26(5), 322–329.
Article Google Scholar
Handelsman, J. (2004). Metagenomics: Application of genomics to uncultured microorganisms. Microbiology and Molecular Biology Reviews, 68(4), 669–685.
Article Google Scholar
Johnson, S., et al. (2014). A better sequence-read simulator program for metagenomics. BMC Bioinformatics, 15(9), S14.
Article Google Scholar
Mandal, S., et al. (2015). Analysis of composition of microbiomes: A novel method for studying microbial composition. Microbial Ecology in Health and Disease, 26(1), 27663.
Google Scholar
McMurdie, P. J., & Holmes, S. (2013). phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PloS One, 8(4), e61217.
Article Google Scholar
McMurdie, P. J., & Holmes, S. (2014). Waste not, want not: Why rarefying microbiome data is inadmissible. PLoS Computational Biology, 10(4), e1003531.
Article Google Scholar
Metzker, M. L. (2010). Sequencing technologies—The next generation. Nature Reviews Genetics, 11(1), 31–46.
Article Google Scholar
National Research Council. (2007). The new science of metagenomics: Revealing the secrets of our microbial planet. Washington, DC: National Academies Press.
Google Scholar
Paulson, J. N., et al. (2013). Differential abundance analysis for microbial marker-gene surveys. Nature Methods, 10(12), 1200–1202.
Article Google Scholar
Paulson, J. N., Bravo, H. C., & Pop, M. (2014). Reply to: “A fair comparison”. Nature methods, 11(4), 359–360.
Article Google Scholar
Peterson, J., et al. (2009). The NIH human microbiome project. Genome Research, 19(12), 2317–2323.
Article Google Scholar
Powell, S., et al. (2014). eggNOG v4. 0: Nested orthology inference across 3686 organisms. Nucleic Acids Research, 42(D1), D231–D239.
Article Google Scholar
Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3), R25.
Article Google Scholar
Shreiner, A. B., Kao, J. Y., & Young, V. B. (2015). The gut microbiome in health and in disease. Current Opinion in Gastroenterology, 31(1), 69.
Article Google Scholar
Sohn, M. B., Du, R., & An, L. (2015). A robust approach for identifying differentially abundant features in metagenomic samples. Bioinformatics, 31(14), 2269–2275.
Article Google Scholar
Srinivas, G., et al. (2013). Genome-wide mapping of gene–microbiota interactions in susceptibility to autoimmune skin blistering. Nature Communications, 4, 2462.
Article Google Scholar
Tatusov, R. L., et al. (2003). The COG database: An updated version includes eukaryotes. BMC Bioinformatics, 4(1), 1.
Article Google Scholar
Tsilimigras, M. C., & Fodor, A. A. (2016). Compositional data analysis of the microbiome: Fundamentals, tools, and challenges. Annals of Epidemiology, 26(5), 330–335.
Article Google Scholar
Turnbaugh, P. J., et al. (2009). The effect of diet on the human gut microbiome: A metagenomic analysis in humanized gnotobiotic mice. Science Translational Medicine, 1(6), 6ra14.
Article Google Scholar
Wang, Q., et al. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261–5267.
Article Google Scholar
Weiss, S., et al. (2017). Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome, 5(1), 27.
Article MathSciNet Google Scholar
White, J. R., Nagarajan, N., & Pop, M. (2009). Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Computational Biology, 5(4), e1000352.
Article Google Scholar
Woese, C. R. (1987). Bacterial evolution. Microbiological Reviews, 51(2), 221.
Google Scholar
Wooley, J. C., Godzik, A., & Friedberg, I. (2010). A primer on metagenomics. PLoS Computational Biology, 6(2), e1000667.
Article Google Scholar
Yang, Y. H., et al. (2002). Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research, 30(4), e15.
Article Google Scholar

Download references

Acknowledgements

The authors are grateful to two anonymous reviewers for their careful reading of the manuscript and their comments and suggestions. ZF’s research is supported by grant U54 GM104940 from the National Institute of General Medical Sciences of the National Institutes of Health, which funds the Louisiana Clinical and Translational Science Center of Pennington Biomedical Research Center. LA’s research is partially supported by National Science Foundation [DMS-1222592] and United States Department of Agriculture [Hatch project, ARZT-1360830-H22-138]. RD’s research was supported in part by the UNM Comprehensive Cancer Center, a recipient of NCI Cancer Support Grant 2 P30 CA118100-11 (PI: Cheryl L. Willman, MD).

Author information

Authors and Affiliations

Biostatistics Shared Resource, University of New Mexico Comprehensive Cancer Center, Albuquerque, NM, USA
Ruofei Du
Department of Agricultural and Biosystems Engineering, University of Arizona, Tucson, AZ, USA
Lingling An
Interdisciplinary Program in Statistics, University of Arizona, Tucson, AZ, USA
Lingling An
Biostatistics Program, School of Public Health, Louisiana State University Health Sciences Center, New Orleans, LA, USA
Zhide Fang

Authors

Ruofei Du
View author publications
You can also search for this author in PubMed Google Scholar
Lingling An
View author publications
You can also search for this author in PubMed Google Scholar
Zhide Fang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhide Fang .

Editor information

Editors and Affiliations

Department of Mathematics and Statistics, Georgia State University, Atlanta, GA, USA
Yichuan Zhao
Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Ding-Geng Chen

Appendix

1.1 Supplementary Data Distribution

Negative Binomial Distribution. A NB distribution is defined as,

$$ \vspace*{-3pt}P\left(X=x\right)=\frac{\Gamma \left(x+r\right)}{\Gamma \left(\mathrm{r}\right)\mathrm{x}!}{p}^r{\left(1-p\right)}^x,\vspace*{-3pt} $$

where r and p are two parameters, and r is called size parameter. The mean of the NB distribution is,

$$ \vspace*{-3pt}\mu =\frac{r\left(1-p\right)}{p},\vspace*{-3pt} $$

and the variance is,

$$ \vspace*{-3pt}V=\frac{r\left(1-p\right)}{p^2}=\mu +\frac{1}{r}{\mu}^2.\vspace*{-3pt} $$

Thus, r indicates the level of overdispersion in the counts.

1.2 Supplementary Illustration of TMM and RLE with Compositional Dataset

For gene expression studies, there is a widely used assumption that the majority of genes do not express differentially between conditions. Many of RNA-Seq normalization methods were developed based on this assumption, including TMM and RLE. The “non-differential” in the assumption is implemented as non-differential absolute abundance after normalization. Subsequent differential analysis is also to compare the normalized counts between conditions, instead of comparing the relative abundances as it is for compositional data.

Focusing on the essence of a normalization procedure, the hypothetical datasets are made of the expectations of counts. For TMM approach, the logarithm function and the weighted sum are not applied since those are designed for reducing the effect of count variation. In Fig. 16.5, the relative abundance ratio, compared to the first sample, is first calculated from the raw counts, i.e., $ \frac{y_{ij}/{\sum}_i{y}_{ij}}{y_{i1}/{\sum}_i{y}_{i1}} $. The trimmed mean of the ratios for each sample, after trimming the largest and smallest values, is used as the scale factor. The true scale factor is 2.6 (390/150), but the output from TMM is 1.73. Figure 16.5 shows a very likely situation for metagenomic compositional data, in which the relative abundances vary largely between conditions. TMM may not work well for such data since it merely relies on the assumption that after normalization most of features should share the same absolute abundance.

For RLE normalization, the geometric mean of the counts to each feature from all the samples is first calculated, see Fig. 16.6. Next, the ratio of a raw count over the mean count for the same feature is computed. The scale factor for a sample is obtained as the median of the ratios for the sample. For this hypothetic dataset, RLE approach does not suggest any normalization adjustment since all the scale factors equal to 1; however, the true library sizes are very different (e.g., 210 vs. 310). In Fig. 16.6, it is clear that the scale factor is determined by the absolute count of Feature 2, 3, 4, or 5, instead of the relative abundance of one of those features. A subsequently comparative analysis would reach the conclusion that there is no differential abundance for Feature 2, 3, 4, or 5 between the conditions. However, the relative abundances of the features have altered, for Feature 4 it is 14% and 10% under the two conditions, respectively.

1.3 Supplementary Example

Mouse stool metagenomic data. Fresh or frozen adult human fecal microbial communities were transplanted into guts of germ-free C57BL/6J mice. Here, germ-free environment is referred to as mice gut that does not previously expose to microbes. Following the transplanting, 12 recipient mice were fed with a standard low-fat, plant polysaccharide-rich diet for 4 weeks; after that, six mice were switched to take high-fat/high-sugar Western diet for another 6 weeks. Amplification and pyrosequencing of V2 region of 16S rRNA genes were performed periodically to record the changes of microbial community structure of fecal samples of the mice (Turnbaugh et al. 2009). There are 85 samples under condition one (associated to low-fat diet fed mice), and 54 samples under condition two (associated to Western diet fed mice). The bioinformatic tool RDP (Wang et al. 2007) was used to generate the count data, which is featured at species level. Together, there are 52 genera shown under both conditions, and the data is considered to represent low complex metagenomic data. Figure 16.7 demonstrates that RLE and RFY should not be recommended for normalization of the metagenomic data.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Du, R., An, L., Fang, Z. (2018). Performance Evaluation of Normalization Approaches for Metagenomic Compositional Data on Differential Abundance Analysis. In: Zhao, Y., Chen, DG. (eds) New Frontiers of Biostatistics and Bioinformatics. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-99389-8_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-99389-8_16
Published: 06 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99388-1
Online ISBN: 978-3-319-99389-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Performance Evaluation of Normalization Approaches for Metagenomic Compositional Data on Differential Abundance Analysis

Abstract

Access this chapter

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Supplementary Data Distribution

1.2 Supplementary Illustration of TMM and RLE with Compositional Dataset

1.3 Supplementary Example

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation