Skip to main content

Performance Evaluation of Normalization Approaches for Metagenomic Compositional Data on Differential Abundance Analysis

  • Chapter
  • First Online:
New Frontiers of Biostatistics and Bioinformatics

Part of the book series: ICSA Book Series in Statistics ((ICSABSS))

Abstract

Background: In recent years, metagenomics, as a combination of research techniques without the process of cultivation, has become more and more popular in studying the genomic/genetic variation of microbes in environmental or clinical samples. Though generated from similar sequencing technologies, there is increasing evidence that metagenomic sequence data may not be treated as another variant of RNA-Seq count data, especially due to its compositional characteristics. While it is often of primary interest to compare taxonomic or functional profiles of microbial communities between conditions, normalization for library size is usually an inevitable step prior to a typical differential abundance analysis. Some methods have been proposed for such normalization. But the existing performance evaluation of normalization methods for metagenomic sequence data does not adequately consider the compositional characteristics.

Result: The normalization methods assessed in this chapter include Total Sum Scaling (TSS), Relative Log Expression (RLE), Trimmed Mean of M-value (TMM), Cumulative Sum Scaling (CSS), and Rarefying (RFY). In addition to compositional proportions, simulated data were generated with consideration of overdispersion, zero inflation, and under-sampling issue. The impact of normalization on subsequent differential abundance analysis was further studied.

Conclusion: Selection of a normalization method for metagenomic compositional data should be made on a case-by-case basis. Simulation using the parameters learned from the experimental data may be carried out to assist the selection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Anders, S., & Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11(10), R106.

    Article  Google Scholar 

  • Anders, S., et al. (2013). Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature Protocols, 8(9), 1765–1786.

    Article  Google Scholar 

  • Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289–300.

    Google Scholar 

  • Bragg, L., & Tyson, G. W. (2014). Metagenomics using next-generation sequencing. Environmental Microbiology: Methods and Protocols, 1096, 183–201.

    Article  Google Scholar 

  • Bullard, J. H., et al. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11(1), 94.

    Article  Google Scholar 

  • Caporaso, J. G., et al. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7(5), 335–336.

    Article  Google Scholar 

  • Cole, J. R., et al. (2013). Ribosomal Database Project: Data and tools for high throughput rRNA analysis. Nucleic Acids Research, 42(D1), D633–D642.

    Article  Google Scholar 

  • Costea, P. I., et al. (2014). A fair comparison. Nature Methods, 11(4), 359.

    Article  Google Scholar 

  • Dillies, M.-A., et al. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Briefings in Bioinformatics, 14(6), 671–683.

    Article  Google Scholar 

  • Fernandes, A. D., et al. (2014). Unifying the analysis of high-throughput sequencing datasets: Characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1), 15.

    Article  MathSciNet  Google Scholar 

  • Gloor, G. B., et al. (2016). It’s all relative: Analyzing microbiome data as compositions. Annals of Epidemiology, 26(5), 322–329.

    Article  Google Scholar 

  • Handelsman, J. (2004). Metagenomics: Application of genomics to uncultured microorganisms. Microbiology and Molecular Biology Reviews, 68(4), 669–685.

    Article  Google Scholar 

  • Johnson, S., et al. (2014). A better sequence-read simulator program for metagenomics. BMC Bioinformatics, 15(9), S14.

    Article  Google Scholar 

  • Mandal, S., et al. (2015). Analysis of composition of microbiomes: A novel method for studying microbial composition. Microbial Ecology in Health and Disease, 26(1), 27663.

    Google Scholar 

  • McMurdie, P. J., & Holmes, S. (2013). phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PloS One, 8(4), e61217.

    Article  Google Scholar 

  • McMurdie, P. J., & Holmes, S. (2014). Waste not, want not: Why rarefying microbiome data is inadmissible. PLoS Computational Biology, 10(4), e1003531.

    Article  Google Scholar 

  • Metzker, M. L. (2010). Sequencing technologies—The next generation. Nature Reviews Genetics, 11(1), 31–46.

    Article  Google Scholar 

  • National Research Council. (2007). The new science of metagenomics: Revealing the secrets of our microbial planet. Washington, DC: National Academies Press.

    Google Scholar 

  • Paulson, J. N., et al. (2013). Differential abundance analysis for microbial marker-gene surveys. Nature Methods, 10(12), 1200–1202.

    Article  Google Scholar 

  • Paulson, J. N., Bravo, H. C., & Pop, M. (2014). Reply to: “A fair comparison”. Nature methods, 11(4), 359–360.

    Article  Google Scholar 

  • Peterson, J., et al. (2009). The NIH human microbiome project. Genome Research, 19(12), 2317–2323.

    Article  Google Scholar 

  • Powell, S., et al. (2014). eggNOG v4. 0: Nested orthology inference across 3686 organisms. Nucleic Acids Research, 42(D1), D231–D239.

    Article  Google Scholar 

  • Robinson, M. D., & Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3), R25.

    Article  Google Scholar 

  • Shreiner, A. B., Kao, J. Y., & Young, V. B. (2015). The gut microbiome in health and in disease. Current Opinion in Gastroenterology, 31(1), 69.

    Article  Google Scholar 

  • Sohn, M. B., Du, R., & An, L. (2015). A robust approach for identifying differentially abundant features in metagenomic samples. Bioinformatics, 31(14), 2269–2275.

    Article  Google Scholar 

  • Srinivas, G., et al. (2013). Genome-wide mapping of gene–microbiota interactions in susceptibility to autoimmune skin blistering. Nature Communications, 4, 2462.

    Article  Google Scholar 

  • Tatusov, R. L., et al. (2003). The COG database: An updated version includes eukaryotes. BMC Bioinformatics, 4(1), 1.

    Article  Google Scholar 

  • Tsilimigras, M. C., & Fodor, A. A. (2016). Compositional data analysis of the microbiome: Fundamentals, tools, and challenges. Annals of Epidemiology, 26(5), 330–335.

    Article  Google Scholar 

  • Turnbaugh, P. J., et al. (2009). The effect of diet on the human gut microbiome: A metagenomic analysis in humanized gnotobiotic mice. Science Translational Medicine, 1(6), 6ra14.

    Article  Google Scholar 

  • Wang, Q., et al. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261–5267.

    Article  Google Scholar 

  • Weiss, S., et al. (2017). Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome, 5(1), 27.

    Article  MathSciNet  Google Scholar 

  • White, J. R., Nagarajan, N., & Pop, M. (2009). Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Computational Biology, 5(4), e1000352.

    Article  Google Scholar 

  • Woese, C. R. (1987). Bacterial evolution. Microbiological Reviews, 51(2), 221.

    Google Scholar 

  • Wooley, J. C., Godzik, A., & Friedberg, I. (2010). A primer on metagenomics. PLoS Computational Biology, 6(2), e1000667.

    Article  Google Scholar 

  • Yang, Y. H., et al. (2002). Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research, 30(4), e15.

    Article  Google Scholar 

Download references

Acknowledgements

The authors are grateful to two anonymous reviewers for their careful reading of the manuscript and their comments and suggestions. ZF’s research is supported by grant U54 GM104940 from the National Institute of General Medical Sciences of the National Institutes of Health, which funds the Louisiana Clinical and Translational Science Center of Pennington Biomedical Research Center. LA’s research is partially supported by National Science Foundation [DMS-1222592] and United States Department of Agriculture [Hatch project, ARZT-1360830-H22-138]. RD’s research was supported in part by the UNM Comprehensive Cancer Center, a recipient of NCI Cancer Support Grant 2 P30 CA118100-11 (PI: Cheryl L. Willman, MD).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhide Fang .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Supplementary Data Distribution

Negative Binomial Distribution. A NB distribution is defined as,

$$ \vspace*{-3pt}P\left(X=x\right)=\frac{\Gamma \left(x+r\right)}{\Gamma \left(\mathrm{r}\right)\mathrm{x}!}{p}^r{\left(1-p\right)}^x,\vspace*{-3pt} $$

where r and p are two parameters, and r is called size parameter. The mean of the NB distribution is,

$$ \vspace*{-3pt}\mu =\frac{r\left(1-p\right)}{p},\vspace*{-3pt} $$

and the variance is,

$$ \vspace*{-3pt}V=\frac{r\left(1-p\right)}{p^2}=\mu +\frac{1}{r}{\mu}^2.\vspace*{-3pt} $$

Thus, r indicates the level of overdispersion in the counts.

1.2 Supplementary Illustration of TMM and RLE with Compositional Dataset

For gene expression studies, there is a widely used assumption that the majority of genes do not express differentially between conditions. Many of RNA-Seq normalization methods were developed based on this assumption, including TMM and RLE. The “non-differential” in the assumption is implemented as non-differential absolute abundance after normalization. Subsequent differential analysis is also to compare the normalized counts between conditions, instead of comparing the relative abundances as it is for compositional data.

Focusing on the essence of a normalization procedure, the hypothetical datasets are made of the expectations of counts. For TMM approach, the logarithm function and the weighted sum are not applied since those are designed for reducing the effect of count variation. In Fig. 16.5, the relative abundance ratio, compared to the first sample, is first calculated from the raw counts, i.e., \( \frac{y_{ij}/{\sum}_i{y}_{ij}}{y_{i1}/{\sum}_i{y}_{i1}} \). The trimmed mean of the ratios for each sample, after trimming the largest and smallest values, is used as the scale factor. The true scale factor is 2.6 (390/150), but the output from TMM is 1.73. Figure 16.5 shows a very likely situation for metagenomic compositional data, in which the relative abundances vary largely between conditions. TMM may not work well for such data since it merely relies on the assumption that after normalization most of features should share the same absolute abundance.

Fig. 16.5
figure 5

TMM normalization on a hypothetical dataset

For RLE normalization, the geometric mean of the counts to each feature from all the samples is first calculated, see Fig. 16.6. Next, the ratio of a raw count over the mean count for the same feature is computed. The scale factor for a sample is obtained as the median of the ratios for the sample. For this hypothetic dataset, RLE approach does not suggest any normalization adjustment since all the scale factors equal to 1; however, the true library sizes are very different (e.g., 210 vs. 310). In Fig. 16.6, it is clear that the scale factor is determined by the absolute count of Feature 2, 3, 4, or 5, instead of the relative abundance of one of those features. A subsequently comparative analysis would reach the conclusion that there is no differential abundance for Feature 2, 3, 4, or 5 between the conditions. However, the relative abundances of the features have altered, for Feature 4 it is 14% and 10% under the two conditions, respectively.

Fig. 16.6
figure 6

RLE normalization steps on a hypothetical dataset

1.3 Supplementary Example

Mouse stool metagenomic data. Fresh or frozen adult human fecal microbial communities were transplanted into guts of germ-free C57BL/6J mice. Here, germ-free environment is referred to as mice gut that does not previously expose to microbes. Following the transplanting, 12 recipient mice were fed with a standard low-fat, plant polysaccharide-rich diet for 4 weeks; after that, six mice were switched to take high-fat/high-sugar Western diet for another 6 weeks. Amplification and pyrosequencing of V2 region of 16S rRNA genes were performed periodically to record the changes of microbial community structure of fecal samples of the mice (Turnbaugh et al. 2009). There are 85 samples under condition one (associated to low-fat diet fed mice), and 54 samples under condition two (associated to Western diet fed mice). The bioinformatic tool RDP (Wang et al. 2007) was used to generate the count data, which is featured at species level. Together, there are 52 genera shown under both conditions, and the data is considered to represent low complex metagenomic data. Figure 16.7 demonstrates that RLE and RFY should not be recommended for normalization of the metagenomic data.

Fig. 16.7
figure 7

Boxplots of coefficients of variation of counts in the raw data, and the normalized data for the mouse stool metagenomic data

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Du, R., An, L., Fang, Z. (2018). Performance Evaluation of Normalization Approaches for Metagenomic Compositional Data on Differential Abundance Analysis. In: Zhao, Y., Chen, DG. (eds) New Frontiers of Biostatistics and Bioinformatics. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-99389-8_16

Download citation

Publish with us

Policies and ethics