Skip to main content
Log in

Metagenomics: Facts and Artifacts, and Computational Challenges

  • Survey
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. By enabling an analysis of populations including many (so-far) unculturable and often unknown microbes, metagenomics is revolutionizing the field of microbiology, and has excited researchers in many disciplines that could benefit from the study of environmental microbes, including those in ecology, environmental sciences, and biomedicine. Specific computational and statistical tools have been developed for metagenomic data analysis and comparison. New studies, however, have revealed various kinds of artifacts present in metagenomics data caused by limitations in the experimental protocols and/or inadequate data analysis procedures, which often lead to incorrect conclusions about a microbial community. Here, we review some of the artifacts, such as overestimation of species diversity and incorrect estimation of gene family frequencies, and discuss emerging computational approaches to address them. We also review potential challenges that metagenomics may encounter with the extensive application of next-generation sequencing (NGS) techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Handelsman J, Rondon M R, Brady S F, Clardy J, Goodman R M. Molecular biological access to the chemistry of unknown soil microbes: A new frontier for natural products. Chemistry & Biology, 1998, 5(10): R245–R249.

    Article  Google Scholar 

  2. Mardis E. Anticipating the 1,000 dollar genome. Genome Biol., 2006, 7(7): 112.

    Article  Google Scholar 

  3. Tyson G, Chapman J, Hugenholtz P, Allen E, Ram R, Richardson P, Solovyev V, Rubin E, Rokhsar D, Banfield J. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature, 2004, 428(6978): 37–43.

    Article  Google Scholar 

  4. Venter J, Remington K, Heidelberg J et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science, 2004, 304(5667): 66–74.

    Article  Google Scholar 

  5. Dinsdale E A, Pantos O, Smriga S, Edwards R A et al. Microbial ecology of four coral atolls in the Northern Line Islands. PLoS ONE, 2008, 3(2): e1584.

    Article  Google Scholar 

  6. Lorenz P, Eck J. Metagenomics and industrial applications. Nat. Rev. Microbiol., 2005, 3(6): 510–516.

    Article  Google Scholar 

  7. Turnbaugh P J, Hamady M, Yatsunenko T et al. A core gut microbiome in obese and lean twins. Nature, 2009, 457(7228): 480–484.

    Article  Google Scholar 

  8. Turnbaugh P J, Ley R E, Hamady M, Fraser-Liggett C M, Knight R, Gordon J I. The human microbiome project. Nature, 2007, 449(7164): 804–810.

    Article  Google Scholar 

  9. Hamady M, Walker J J, Harris J K, Gold N J, Knight R. Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat. Methods, 2008, 5(3): 235–237.

    Article  Google Scholar 

  10. Li L, McCorkle S, Monchy S, Taghavi S, van der Lelie D. Bioprospecting metagenomes: Glycosyl hydrolases for converting biomass. Biotechnol. Biofuels, 2009, 2: 10.

    Article  Google Scholar 

  11. Brulc J, Antonopoulos D, Miller M et al. Gene-centric metagenomics of the fiber-adherent bovine rumen microbiome reveals forage specific glycoside hydrolases. Proc. Natl. Acad. Sci. USA, 2009, 106(6): 1948–1953.

    Article  Google Scholar 

  12. Jones B, Begley M, Hill C, Gahan C, Marchesi J. Functional and comparative metagenomic analysis of bile salt hydrolase activity in the human gut microbiome. Proc. Natl. Acad. Sci. USA, 2008, 105(36): 13580–13585.

    Article  Google Scholar 

  13. Mori T, Mizuta S, Suenaga H, Miyazaki K. Metagenomic screening for bleomycin resistance genes. Appl. Environ. Microbiol., 2008, 74(21): 6803–6805.

    Article  Google Scholar 

  14. Steele H, Jaeger K, Daniel R, Streit W. Advances in recovery of novel biocatalysts from metagenomes. J Mol. Microbiol. Biotechnol., 2009, 16(1/2): 25–37.

    Article  Google Scholar 

  15. Handelsman J, Tiedje J M, Alvarez-Cohen L et al. The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet. The National Academies Press, 2007.

  16. Tringe S, von Mering C, Kobayashi A et al. Comparative metagenomics of microbial communities. Science, 2005, 308(5721): 554–557.

    Article  Google Scholar 

  17. Turnbaugh P J, Ley R E, Mahowald M A, Magrini V, Mardis E R, Gordon J I. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature, 2006, 444(7122): 1027–1131.

    Article  Google Scholar 

  18. Hooper S D, Raes J, Foerstner K U, Harrington E D, Dalevi D, Bork P. A molecular study of microbe transfer between distant environments. PLoS ONE, 2008, 3(7): e2607.

    Article  Google Scholar 

  19. Raes J, Foerstner K U, Bork P. Get the most out of your metagenome: Computational analysis of environmental sequence data. Curr. Opin. Microbiol., 2007, 10(5): 490–498.

    Article  Google Scholar 

  20. Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev., 2008, 72(4): 557–578, Table of Contents.

    Article  Google Scholar 

  21. Hamady M, Knight R. Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Res., 2009, 19(7): 1141–1152.

    Article  Google Scholar 

  22. Galperin M. Metagenomics: From acid mine to shining sea. Environ. Microbiol., 2004, 6(6): 543–545.

    Article  Google Scholar 

  23. Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J. De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Res., 2008, 18(5): 802–809.

    Article  Google Scholar 

  24. Butler J, MacCallum I, Kleber M, Shlyakhter I A, Belmonte M K, Lander E S, Nusbaum C, Jaffe D B. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res., 2008, 18(5): 810–820.

    Article  Google Scholar 

  25. Chaisson M J, Pevzner P A. Short read fragment assembly of bacterial genomes. Genome Res., 2008, 18(2): 324–330.

    Article  Google Scholar 

  26. Pop M. Genome assembly reborn: Recent computational challenges. Brief Bioinform., 2009, 10(4): 354–366.

    Article  Google Scholar 

  27. Noguchi H, Park J, Takagi T. MetaGene: Prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res., 2006, 34(19): 5623–5630.

    Article  Google Scholar 

  28. Hoff K J, Tech M, Lingner T, Daniel R, Morgenstern B, Meinicke P. Gene prediction in metagenomic fragments: A large scale machine learning approach. BMC Bioinformatics, 2008, 9: 217.

    Article  Google Scholar 

  29. Hoff K J, Lingner T, Meinicke P, Tech M. Orphelia: Predicting genes in metagenomic sequencing reads. Nucleic Acids Res., 2009, 37(Web Server Issue): W101–W105.

    Article  Google Scholar 

  30. Krause L, Diaz N N, Bartels D, Edwards R A, Puhler A, Rohwer F, Meyer F, Stoye J. Finding novel genes in bacterial communities isolated from the environment. Bioinformatics, 2006, 22(14): e281–e289.

    Article  Google Scholar 

  31. Ye Y, Tang H. An orfome assembly approach to metagenomics sequences analysis. J. Bioinform. Comput. Biol., 2009, 7(3): 455–471.

    Article  Google Scholar 

  32. Cardenas E, Tiedje J. New tools for discovering and characterizing microbial diversity. Curr. Opin. Biotechnol., 2008, 19(6): 544–549.

    Article  Google Scholar 

  33. Huson D H, Auch A F, Qi J, Schuster S C. MEGAN analysis of metagenomic data. Genome Res., 2007, 17(3): 377–386.

    Article  Google Scholar 

  34. Chakravorty S, Helb D, Burday M, Connell N, Alland D. A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. J. Microbiol. Methods, 2007, 69(2): 330–339.

    Article  Google Scholar 

  35. Monier A, Claverie J M, Ogata H. Taxonomic distribution of large DNA viruses in the sea. Genome Biol., 2008, 9(7): R106.

    Article  Google Scholar 

  36. Ciccarelli F D, Doerks T, von Mering C, Creevey C J, Snel B, Bork P. Toward automatic reconstruction of a highly resolved tree of life. Science, 2006, 311(5765): 1283–1287.

    Article  Google Scholar 

  37. von Mering C, Hugenholtz P, Raes J, Tringe S G, Doerks T, Jensen L J, Ward N, Bork P. Quantitative phylogenetic assessment of microbial communities in diverse environments. Science, 2007, 315(5815): 1126–1130.

    Article  Google Scholar 

  38. Wu M, Eisen J A. A simple, fast, and accurate method of phylogenomic inference. Genome Biol., 2008, 9(10): R151.

    Article  Google Scholar 

  39. Krause L, Diaz N N, Goesmann A, Kelley S, Nattkemper T W, Rohwer F, Edwards R A, Stoye J. Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res., 2008, 36(7): 2230–2239.

    Article  Google Scholar 

  40. Finn R D, Mistry J, Schuster-Bockler B et al. Pfam: Clans, Web tools and services. Nucleic Acids Res., 2006, 34(Database Issue): D247–D251.

    Article  Google Scholar 

  41. Bentley S D, Parkhill J. Comparative genomic structure of prokaryotes. Annu. Rev. Genet., 2004, 38: 771–792.

    Article  Google Scholar 

  42. Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner F O. TETRA: A Web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 2004, 5: 163.

    Article  Google Scholar 

  43. Woyke T, Teeling H, Ivanova N N et al. Symbiosis insights through metagenomic analysis of a microbial consortium. Nature, 2006, 443(7114): 950–955.

    Article  Google Scholar 

  44. Chatterji S, Yamazaki I, Bai Z, Eisen J. CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. In Proc. the 12th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2008), Singapore, March 30–April 2, 2008, pp.17–28.

  45. Zhou F, Olman V, Xu Y. Barcodes for genomes and applications. BMC Bioinformatics, 2008, 9: 546.

    Article  Google Scholar 

  46. Brady A, Salzberg S L. Phymm and PhymmBL: Metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods, 2009, 6(9): 673–676.

    Article  Google Scholar 

  47. Gilbert J A, Field D, Huang Y, Edwards R, Li W, Gilna P, Joint I. Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities. PLoS One, 2008, 3(8): e3042.

    Article  Google Scholar 

  48. Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 2000, 28(1): 27–30.

    Article  Google Scholar 

  49. Overbeek R, Begley T, Butler R M et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res., 2005, 33(7): 5691–5702.

    Article  Google Scholar 

  50. Dinsdale E A, Edwards R A, Hall D et al. Functional metagenomic profiling of nine biomes. Nature, 2008, 452(7187): 629–632.

    Article  Google Scholar 

  51. Meyer F, Paarmann D, D’Souza M et al. The metagenomics RAST server — A public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 2008, 9: 386.

    Article  Google Scholar 

  52. Yooseph S, Sutton G, Rusch D B et al. The Sorcerer II Global Ocean Sampling expedition: Expanding the universe of protein families. PLoS Biol., 2007, 5(3): e16.

    Article  Google Scholar 

  53. Li W, Wooley J C, Godzik A. Probing metagenomics by rapid cluster analysis of very large datasets. PLoS One, 2008, 3(10): e3375.

    Article  Google Scholar 

  54. Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 2001, 17(3): 282–283.

    Article  Google Scholar 

  55. Marcotte E M. Computational genetics: Finding protein function by nonhomology methods. Curr. Opin. Struct. Biol., 2000, 10(3): 359–365.

    Article  MathSciNet  Google Scholar 

  56. Tringe S G, von Mering C, Kobayashi A et al. Comparative metagenomics of microbial communities. Science, 2005, 308(5721): 554–557.

    Article  Google Scholar 

  57. Foerstner K U, von Mering C, Hooper S D, Bork P. Environments shape the nucleotide composition of genomes. EMBO Rep., 2005, 6(12): 1208–1213.

    Article  Google Scholar 

  58. Raes J, Korbel J O, Lercher M J, von Mering C, Bork P. Prediction of effective genome size in metagenomic samples. Genome Biol., 2007, 8(1): R10.

    Article  Google Scholar 

  59. Gianoulis T A, Raes J, Patel P V et al. Quantifying environmental adaptation of metabolic pathways in metagenomics. Proc. Natl. Acad. Sci. USA, 2009, 106(5): 1374–1379.

    Article  Google Scholar 

  60. Lozupone C, Knight R. UniFrac: A new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol., 2005, 71(12): 8228–8235.

    Article  Google Scholar 

  61. Huson D H, Richter D C, Mitra S, Auch A F, Schuster S C. Methods for comparative metagenomics. BMC Bioinformatics, 2009, 10(Suppl 1): S12.

    Article  Google Scholar 

  62. Mitra S, Klar B, Huson D H. Visual and statistical comparison of metagenomes. Bioinformatics, 2009, 25(15): 1849–1855.

    Article  Google Scholar 

  63. Schloss P D, Handelsman J. A statistical toolbox for metagenomics: Assessing functional diversity in microbial communities. BMC Bioinformatics, 2008, 9: 34.

    Article  Google Scholar 

  64. Wommack K E, Bhavsar J, Ravel J. Metagenomics: Read length matters. Appl. Environ. Microbiol., 2008, 74(5): 1453–1463.

    Article  Google Scholar 

  65. Hughes J B, Hellmann J J, Ricketts T H, Bohannan B J. Counting the uncountable: Statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol., 2001, 67(10): 4399–4406.

    Article  Google Scholar 

  66. Breitbart M, Salamon P, Andresen B, Mahaffy J M, Segall A M, Mead D, Azam F, Rohwer F. Genomic analysis of uncultured marine viral communities. Proc. Natl. Acad. Sci. USA, 2002, 99(22): 14250–14255.

    Article  Google Scholar 

  67. Schloss P D, Handelsman J. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl. Environ. Microbiol., 2005, 71(3): 1501–1506.

    Article  Google Scholar 

  68. Angly F, Rodriguez-Brito B, Bangor D, McNairnie P, Breitbart M, Salamon P, Felts B, Nulton J, Mahaffy J, Rohwer F. PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information. BMC Bioinformatics, 2005, 6: 41.

    Article  Google Scholar 

  69. Schloss P D. Evaluating different approaches that test whether microbial communities have the same structure. ISME J, 2008, 2(3): 265–275.

    Article  Google Scholar 

  70. Schloss P D, Handelsman J. Introducing SONS, a tool for operational taxonomic unit-based comparisons of microbial community memberships and structures. Appl. Environ. Microbiol., 2006, 72(10): 6773–6779.

    Article  Google Scholar 

  71. White J, Nagarajan N, Pop M. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput. Biol., 2009, 5(4): e1000352.

    Article  Google Scholar 

  72. Zaneveld J, Turnbaugh P J, Lozupone C, Ley R E, Hamady M, Gordon J I, Knight R. Host-bacterial coevolution and the search for new drug targets. Curr. Opin. Chem. Biol., 2008, 12(1): 109–114.

    Article  Google Scholar 

  73. Ley R E, Hamady M, Lozupone C, Turnbaugh P J, Ramey R R, Bircher J S, Schlegel M L, Tucker T A, Schrenzel M D, Knight R, Gordon J I. Evolution of mammals and their gut microbes. Science, 2008, 320(5883): 1647–1651.

    Article  Google Scholar 

  74. Shannon P, Markiel A, Ozier O, Baliga N S, Wang J T, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res., 2003, 13(11): 2498–2504.

    Article  Google Scholar 

  75. Rusch D B, Halpern A L, Sutton G et al. The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through eastern tropical Pacific. PLoS Biol., 2007, 5(3): e77.

    Article  Google Scholar 

  76. Ashelford K E, Chuzhanova N A, Fry J C, Jones A J, Weightman A J. At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl. Environ. Microbiol., 2005, 71(12): 7724–7736.

    Article  Google Scholar 

  77. Williams R, Peisajovich S, Miller O, Magdassi S, Tawfik D, Griffiths A. Amplification of complex gene libraries by emulsion PCR. Nat. Methods, 2006, 3(7): 545–550.

    Article  Google Scholar 

  78. Huber T, Faulkner G, Hugenholz P. Bellerophon: A program to detect chimeric sequences in multiple sequence alignments. Bioinformatics, 2004, 20(14): 2317–2319.

    Article  Google Scholar 

  79. Ashelford K E, Chuzhanova N A, Fry J C, Jones A J, Weightman A J. New screening software shows that most recent large 16S rRNA gene clone libraries contain chimeras. Appl. Environ. Microbiol., 2006, 72(9): 5734–5741.

    Article  Google Scholar 

  80. Gomez-Alvarez V, Teal T, Schmidt T. Systematic artifacts in metagenomes from complex microbial communities. ISME J, 2009, 3(11): 1314–1317.

    Article  Google Scholar 

  81. Sharon I, Pati A, Markowitz V M, Pintter R Y. A statistical framework for the functional analysis of metagenomes. In Proc. RECOMB 2009, Tucson, USA, May 18–21, 2009, pp.496–511.

  82. Lander E S, Waterman M S. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics, 1988, 2: 231–239.

    Google Scholar 

  83. Ye Y, Doak T G. A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes. PLoS Comput. Biol., 2009, 5(8): e1000465.

    Article  Google Scholar 

  84. Okuda S, Yamada T, Hamajima M, Itoh M, Katayama T, Bork P, Goto S, Kanehisa M. KEGG Atlas mapping for global analysis of metabolic pathways. Nucleic Acids Res., 2008, 36(Web Server Issue): W423–W426.

    Article  Google Scholar 

  85. Rosin F M, Watanabe N, Lam E. Moonlighting vacuolar protease: Multiple jobs for a busy protein. Trends Plant Sci., 2005, 10(11): 516–518.

    Article  Google Scholar 

  86. Seshadri R, Kravitz S A, Smarr L, Gilna P, Frazier M. CAMERA: A community resource for metagenomics. PLoS Biol., 2007, 5(3): e75.

    Article  Google Scholar 

  87. Price M N, Dehal P S, Arkin A P. FastBLAST: Homology relationships for millions of proteins. PLoS One, 2008, 3(10): e3589.

    Article  Google Scholar 

  88. Sun Y, Cai Y, Liu L, Yu F, Farrell M L, McKendree W, Farmerie W. ESPRIT: Estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res., 2009, 37(10): e76.

    Article  Google Scholar 

  89. Shi Y, Tyson G W, DeLong E F. Metatranscriptomics reveals unique microbial small RNAs in the ocean’s water column. Nature, 2009, 459(7244): 266–269.

    Article  Google Scholar 

  90. Verberkmoes N C, Russell A L, Shah M, Godzik A, Rosenquist M, Halfvarson J, Lefsrud M G, Apajalahti J, Tysk C, Hettich R L, Jansson J K. Shotgun metaproteomics of the human distal gut microbiota. ISME J, 2009, 3(2): 179–189.

    Article  Google Scholar 

  91. Frias-Lopez J, Shi Y, Tyson G W, Coleman M L, Schuster S C, Chisholm S W, Delong E F. Microbial community gene expression in ocean surface waters. Proc. Natl. Acad. Sci. USA, 2008, 105(10): 3805–3810.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to John C. Wooley.

Additional information

This work is supported by NIH under Grant No. 1R01HG004908-01, NSF of USA under Grant No. DBI-0845685 (YY), and also the Gordon and Betty Moore Foundation for the Community Cyberinfrastructure for Marine Microbial Ecological Research and Analysis (CAMERA) Project (JW).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wooley, J.C., Ye, Y. Metagenomics: Facts and Artifacts, and Computational Challenges. J. Comput. Sci. Technol. 25, 71–81 (2010). https://doi.org/10.1007/s11390-010-9306-4

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-010-9306-4

Keywords

Navigation