A Statistical Framework for the Functional Analysis of Metagenomes

  • Itai Sharon
  • Amrita Pati
  • Victor M. Markowitz
  • Ron Y. Pinter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5541)


Metagenomicstudies consider the genetic makeup of microbial communities as a whole, rather than their individual member organisms. The functional and metabolic potential of microbial communities can be analyzed by comparing the relative abundance of gene families in their collective genomic sequences (metagenome) under different conditions. Such comparisons require accurate estimation of gene family frequencies. We present a statistical framework for assessing these frequencies based on the Lander-Waterman theory developed originally for Whole Genome Shotgun (WGS) sequencing projects. We also provide a novel method for assessing the reliability of the estimations which can be used for removing seemingly unreliable measurements. We tested our method on a wide range of datasets, including simulated genomes and real WGS data from sequencing projects of whole genomes. Results suggest that our framework corrects inherent biases in accepted methods and provides a good approximation to the true statistics of gene families in WGS projects.


metagenomics functional analysis function comparison Lander-Waterman 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Beja, O., Aravind, L., Koonin, E.V., Suzuki, M.T., Hadd, A., et al.: Bacterial Rhodopsin: Evidence for a New Type of Phototrophy in the Sea. Science 289(5486), 1902–1906 (2000)CrossRefPubMedGoogle Scholar
  2. 2.
    Venter, J.C., Remington, K., Heidelberg, J.F., Halpern, A.L., Rusch, D., et al.: Environmental Genome Shotgun Sequencing of the Sargasso Sea. Science 304(5667), 66–74 (2004)CrossRefPubMedGoogle Scholar
  3. 3.
    Angly, E.A., Felts, B., Salamon, P., Edwards, E.A., Carlson, C., et al.: The Marine Viromes of Four Oceanic Regions. PLoS Biol. 4(11) (2006)Google Scholar
  4. 4.
    Tyson, G.W., Chapman, J., Hugenholtz, P., Allen, E.E., Ram, R.J., et al.: Community Structure and Metabolism through Reconstruction of Microbial Genomes from the Environment. Nature 428(6978), 37–43 (2004)CrossRefPubMedGoogle Scholar
  5. 5.
    Gill, S.R., Pop, M., Deboy, R.T., Eckburg, P.B., Turnbaugh, P.J., et al.: Metagenomic Analysis of the Human Distal Gut Microbiome. Science 312(5778), 1355–1359 (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  6. 6.
    DeLong, E.F., Preston, C.M., Mincer, T., Rich, V., Hallam, S.J., et al.: Community Genomics among Stratified Microbial Assemblages in the Ocean’s Interior. Science 311(5760), 496–503 (2006)CrossRefPubMedGoogle Scholar
  7. 7.
    Markowitz, V.M., Szeto, E., Palaniappan, K., Grechkin, Y., Chu, K., et al.: The Integrated Microbial Genomes (IMG) System in 2007: Data Content and Analysis Tool Extensions. Nucleic Acids Res. 36(Database Issue), DS528–DS533 (2008)Google Scholar
  8. 8.
    Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., et al.: The COG Database: an Updated Version Includes Eukaryotes. BMC Bioinformatics 4, 41 (2003)CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sammut, J.S., et al.: The Pfam Protein Families Database. Nucleic Acids Res. 36(Database Issue), D281–D288 (2008)Google Scholar
  10. 10.
    Haft, D.H., Selengut, J.D., White, O.: The TIGRFAMs Database of Protein Families. Nucleic Acids Res. 31, 371–373 (2003)CrossRefPubMedPubMedCentralGoogle Scholar
  11. 11.
    Rodriguez-Brito, B., Rohwer, F., Edwards, R.A.: An Application of Statistics to Comparative Metagenomics. BMC Bioinformatics 20(7), 162 (2006)CrossRefGoogle Scholar
  12. 12.
    Tringe, S.G., von Mering, C., Kobayashi, A., Salamov, A.A., Chen, K., et al.: Comparative Metagenomics of Microbial Communities. Science 308(5721), 554–557 (2005)CrossRefPubMedGoogle Scholar
  13. 13.
    Rusch, D.B., Halpern, A.L., Sutton, G., Heidelberg, K.B., Williamson, S., et al.: The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biol. 5(3), e77 (2007)CrossRefGoogle Scholar
  14. 14.
    Yooseph, S., Sutton, G., Rusch, D.B., Halpern, A.L., Williamson, S.J., et al.: The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families. PLoS Biol. 5(3), e16 (2007)CrossRefGoogle Scholar
  15. 15.
    Overbeek, R., Begley, T., Butler, R.M., Choudhuri, J.V., Chuang, H.Y., et al.: The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes. Nucleic Acids Res. 33, 5691–5702 (2005)CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Lander, E.S., Waterman, M.S.: Genomic Mapping by Fingerprinting Random Clones: a Mathematical Analysis. Genomics 2(3), 231–239 (1988)CrossRefPubMedGoogle Scholar
  17. 17.
    Schloss, P.D., Handelssman, J.: A Statistical Toolbox for Metagenomics: Assessing Functional Diversity in Microbial Communities. BMC Bioinformatics 9(34) (2008)Google Scholar
  18. 18.
    Sorek, R., Zhu, Y., Creevey, C., Francino, M.P., Bork, P., Rubin, E.M.: Genome-wide Experimental Determination of Barriers to Horizontal Gene Transfer. Science 318(5855), 1449–1452 (2007)CrossRefPubMedGoogle Scholar
  19. 19.
    Mavromatis, K., Ivanova, N., Barry, K., Shapiro, H., Goltsman, E., et al.: Use of Simulated Data Sets to Evaluate the Fidelity of Metagenomic Processing Methods. Nature Methods 4, 495–500 (2007)CrossRefPubMedGoogle Scholar
  20. 20.
    Sanger, F., Coulson, A.R., Hong, G.F., Hill, D.F., Petersen, G.B.: Nucleotide Sequence of Bacteriophage Lambda DNA. J. Mol. Biol. 162, 4 (1982)CrossRefGoogle Scholar
  21. 21.
    Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., et al.: Whole-genome Random Sequencing and Assembly of Haemophilus influenzae Rd. Science 269(5223), 496–512 (1995)CrossRefPubMedGoogle Scholar
  22. 22.
    Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., et al.: The Sequence of the Human Genome. Science 291(5507), 1304–1351 (2001)CrossRefPubMedGoogle Scholar
  23. 23.
    Kanehisa, M., Goto, S.: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000)CrossRefPubMedPubMedCentralGoogle Scholar
  24. 24.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990)CrossRefPubMedGoogle Scholar
  25. 25.
    Martín-Cuadrado, A.B., López-García, P., Gottschalk, G., Rodríguez-Valera, F.: Metagenomics of the Deep Mediterranean, a Warm Bathypelagic Habitat. PLoS ONE 2, 914 (2007)CrossRefGoogle Scholar
  26. 26.
    Warnecke, F., Luginbuhl, P., Ivanova, N., Ghassemian, M., Richardson, T.H., et al.: Metagenomic and Functional Analysis of Hindgut Microbiota of a Wood Feeding Higher Termite. Nature 450, 560–565 (2007)CrossRefPubMedGoogle Scholar
  27. 27.
    Marchler-Bauer, A., Anderson, J.B., Chitsaz, F., Derbyshire, M.K., DeWeese-Scott, C., et al.: Specific Functional Annotation with the Conserved Domain Database. Nucleic Acids Res. 37(Database Issue), D205–D210Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Itai Sharon
    • 1
  • Amrita Pati
    • 2
  • Victor M. Markowitz
    • 3
  • Ron Y. Pinter
    • 1
  1. 1.Department of Computer ScienceTechnionHaifaIsrael
  2. 2.Genome Biology ProgramDOE Joint Genome InstituteWalnut CreekUSA
  3. 3.Biological Data Management and Technology CenterLawrence Berkeley National LaboratoryBerkeleyUSA

Personalised recommendations