Abstract
Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. By enabling an analysis of populations including many (so-far) unculturable and often unknown microbes, metagenomics is revolutionizing the field of microbiology, and has excited researchers in many disciplines that could benefit from the study of environmental microbes, including those in ecology, environmental sciences, and biomedicine. Specific computational and statistical tools have been developed for metagenomic data analysis and comparison. New studies, however, have revealed various kinds of artifacts present in metagenomics data caused by limitations in the experimental protocols and/or inadequate data analysis procedures, which often lead to incorrect conclusions about a microbial community. Here, we review some of the artifacts, such as overestimation of species diversity and incorrect estimation of gene family frequencies, and discuss emerging computational approaches to address them. We also review potential challenges that metagenomics may encounter with the extensive application of next-generation sequencing (NGS) techniques.
Similar content being viewed by others
References
Handelsman J, Rondon M R, Brady S F, Clardy J, Goodman R M. Molecular biological access to the chemistry of unknown soil microbes: A new frontier for natural products. Chemistry & Biology, 1998, 5(10): R245–R249.
Mardis E. Anticipating the 1,000 dollar genome. Genome Biol., 2006, 7(7): 112.
Tyson G, Chapman J, Hugenholtz P, Allen E, Ram R, Richardson P, Solovyev V, Rubin E, Rokhsar D, Banfield J. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature, 2004, 428(6978): 37–43.
Venter J, Remington K, Heidelberg J et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science, 2004, 304(5667): 66–74.
Dinsdale E A, Pantos O, Smriga S, Edwards R A et al. Microbial ecology of four coral atolls in the Northern Line Islands. PLoS ONE, 2008, 3(2): e1584.
Lorenz P, Eck J. Metagenomics and industrial applications. Nat. Rev. Microbiol., 2005, 3(6): 510–516.
Turnbaugh P J, Hamady M, Yatsunenko T et al. A core gut microbiome in obese and lean twins. Nature, 2009, 457(7228): 480–484.
Turnbaugh P J, Ley R E, Hamady M, Fraser-Liggett C M, Knight R, Gordon J I. The human microbiome project. Nature, 2007, 449(7164): 804–810.
Hamady M, Walker J J, Harris J K, Gold N J, Knight R. Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat. Methods, 2008, 5(3): 235–237.
Li L, McCorkle S, Monchy S, Taghavi S, van der Lelie D. Bioprospecting metagenomes: Glycosyl hydrolases for converting biomass. Biotechnol. Biofuels, 2009, 2: 10.
Brulc J, Antonopoulos D, Miller M et al. Gene-centric metagenomics of the fiber-adherent bovine rumen microbiome reveals forage specific glycoside hydrolases. Proc. Natl. Acad. Sci. USA, 2009, 106(6): 1948–1953.
Jones B, Begley M, Hill C, Gahan C, Marchesi J. Functional and comparative metagenomic analysis of bile salt hydrolase activity in the human gut microbiome. Proc. Natl. Acad. Sci. USA, 2008, 105(36): 13580–13585.
Mori T, Mizuta S, Suenaga H, Miyazaki K. Metagenomic screening for bleomycin resistance genes. Appl. Environ. Microbiol., 2008, 74(21): 6803–6805.
Steele H, Jaeger K, Daniel R, Streit W. Advances in recovery of novel biocatalysts from metagenomes. J Mol. Microbiol. Biotechnol., 2009, 16(1/2): 25–37.
Handelsman J, Tiedje J M, Alvarez-Cohen L et al. The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet. The National Academies Press, 2007.
Tringe S, von Mering C, Kobayashi A et al. Comparative metagenomics of microbial communities. Science, 2005, 308(5721): 554–557.
Turnbaugh P J, Ley R E, Mahowald M A, Magrini V, Mardis E R, Gordon J I. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature, 2006, 444(7122): 1027–1131.
Hooper S D, Raes J, Foerstner K U, Harrington E D, Dalevi D, Bork P. A molecular study of microbe transfer between distant environments. PLoS ONE, 2008, 3(7): e2607.
Raes J, Foerstner K U, Bork P. Get the most out of your metagenome: Computational analysis of environmental sequence data. Curr. Opin. Microbiol., 2007, 10(5): 490–498.
Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev., 2008, 72(4): 557–578, Table of Contents.
Hamady M, Knight R. Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Res., 2009, 19(7): 1141–1152.
Galperin M. Metagenomics: From acid mine to shining sea. Environ. Microbiol., 2004, 6(6): 543–545.
Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J. De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer. Genome Res., 2008, 18(5): 802–809.
Butler J, MacCallum I, Kleber M, Shlyakhter I A, Belmonte M K, Lander E S, Nusbaum C, Jaffe D B. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res., 2008, 18(5): 810–820.
Chaisson M J, Pevzner P A. Short read fragment assembly of bacterial genomes. Genome Res., 2008, 18(2): 324–330.
Pop M. Genome assembly reborn: Recent computational challenges. Brief Bioinform., 2009, 10(4): 354–366.
Noguchi H, Park J, Takagi T. MetaGene: Prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res., 2006, 34(19): 5623–5630.
Hoff K J, Tech M, Lingner T, Daniel R, Morgenstern B, Meinicke P. Gene prediction in metagenomic fragments: A large scale machine learning approach. BMC Bioinformatics, 2008, 9: 217.
Hoff K J, Lingner T, Meinicke P, Tech M. Orphelia: Predicting genes in metagenomic sequencing reads. Nucleic Acids Res., 2009, 37(Web Server Issue): W101–W105.
Krause L, Diaz N N, Bartels D, Edwards R A, Puhler A, Rohwer F, Meyer F, Stoye J. Finding novel genes in bacterial communities isolated from the environment. Bioinformatics, 2006, 22(14): e281–e289.
Ye Y, Tang H. An orfome assembly approach to metagenomics sequences analysis. J. Bioinform. Comput. Biol., 2009, 7(3): 455–471.
Cardenas E, Tiedje J. New tools for discovering and characterizing microbial diversity. Curr. Opin. Biotechnol., 2008, 19(6): 544–549.
Huson D H, Auch A F, Qi J, Schuster S C. MEGAN analysis of metagenomic data. Genome Res., 2007, 17(3): 377–386.
Chakravorty S, Helb D, Burday M, Connell N, Alland D. A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. J. Microbiol. Methods, 2007, 69(2): 330–339.
Monier A, Claverie J M, Ogata H. Taxonomic distribution of large DNA viruses in the sea. Genome Biol., 2008, 9(7): R106.
Ciccarelli F D, Doerks T, von Mering C, Creevey C J, Snel B, Bork P. Toward automatic reconstruction of a highly resolved tree of life. Science, 2006, 311(5765): 1283–1287.
von Mering C, Hugenholtz P, Raes J, Tringe S G, Doerks T, Jensen L J, Ward N, Bork P. Quantitative phylogenetic assessment of microbial communities in diverse environments. Science, 2007, 315(5815): 1126–1130.
Wu M, Eisen J A. A simple, fast, and accurate method of phylogenomic inference. Genome Biol., 2008, 9(10): R151.
Krause L, Diaz N N, Goesmann A, Kelley S, Nattkemper T W, Rohwer F, Edwards R A, Stoye J. Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res., 2008, 36(7): 2230–2239.
Finn R D, Mistry J, Schuster-Bockler B et al. Pfam: Clans, Web tools and services. Nucleic Acids Res., 2006, 34(Database Issue): D247–D251.
Bentley S D, Parkhill J. Comparative genomic structure of prokaryotes. Annu. Rev. Genet., 2004, 38: 771–792.
Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner F O. TETRA: A Web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 2004, 5: 163.
Woyke T, Teeling H, Ivanova N N et al. Symbiosis insights through metagenomic analysis of a microbial consortium. Nature, 2006, 443(7114): 950–955.
Chatterji S, Yamazaki I, Bai Z, Eisen J. CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. In Proc. the 12th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2008), Singapore, March 30–April 2, 2008, pp.17–28.
Zhou F, Olman V, Xu Y. Barcodes for genomes and applications. BMC Bioinformatics, 2008, 9: 546.
Brady A, Salzberg S L. Phymm and PhymmBL: Metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods, 2009, 6(9): 673–676.
Gilbert J A, Field D, Huang Y, Edwards R, Li W, Gilna P, Joint I. Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities. PLoS One, 2008, 3(8): e3042.
Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 2000, 28(1): 27–30.
Overbeek R, Begley T, Butler R M et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res., 2005, 33(7): 5691–5702.
Dinsdale E A, Edwards R A, Hall D et al. Functional metagenomic profiling of nine biomes. Nature, 2008, 452(7187): 629–632.
Meyer F, Paarmann D, D’Souza M et al. The metagenomics RAST server — A public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 2008, 9: 386.
Yooseph S, Sutton G, Rusch D B et al. The Sorcerer II Global Ocean Sampling expedition: Expanding the universe of protein families. PLoS Biol., 2007, 5(3): e16.
Li W, Wooley J C, Godzik A. Probing metagenomics by rapid cluster analysis of very large datasets. PLoS One, 2008, 3(10): e3375.
Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 2001, 17(3): 282–283.
Marcotte E M. Computational genetics: Finding protein function by nonhomology methods. Curr. Opin. Struct. Biol., 2000, 10(3): 359–365.
Tringe S G, von Mering C, Kobayashi A et al. Comparative metagenomics of microbial communities. Science, 2005, 308(5721): 554–557.
Foerstner K U, von Mering C, Hooper S D, Bork P. Environments shape the nucleotide composition of genomes. EMBO Rep., 2005, 6(12): 1208–1213.
Raes J, Korbel J O, Lercher M J, von Mering C, Bork P. Prediction of effective genome size in metagenomic samples. Genome Biol., 2007, 8(1): R10.
Gianoulis T A, Raes J, Patel P V et al. Quantifying environmental adaptation of metabolic pathways in metagenomics. Proc. Natl. Acad. Sci. USA, 2009, 106(5): 1374–1379.
Lozupone C, Knight R. UniFrac: A new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol., 2005, 71(12): 8228–8235.
Huson D H, Richter D C, Mitra S, Auch A F, Schuster S C. Methods for comparative metagenomics. BMC Bioinformatics, 2009, 10(Suppl 1): S12.
Mitra S, Klar B, Huson D H. Visual and statistical comparison of metagenomes. Bioinformatics, 2009, 25(15): 1849–1855.
Schloss P D, Handelsman J. A statistical toolbox for metagenomics: Assessing functional diversity in microbial communities. BMC Bioinformatics, 2008, 9: 34.
Wommack K E, Bhavsar J, Ravel J. Metagenomics: Read length matters. Appl. Environ. Microbiol., 2008, 74(5): 1453–1463.
Hughes J B, Hellmann J J, Ricketts T H, Bohannan B J. Counting the uncountable: Statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol., 2001, 67(10): 4399–4406.
Breitbart M, Salamon P, Andresen B, Mahaffy J M, Segall A M, Mead D, Azam F, Rohwer F. Genomic analysis of uncultured marine viral communities. Proc. Natl. Acad. Sci. USA, 2002, 99(22): 14250–14255.
Schloss P D, Handelsman J. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl. Environ. Microbiol., 2005, 71(3): 1501–1506.
Angly F, Rodriguez-Brito B, Bangor D, McNairnie P, Breitbart M, Salamon P, Felts B, Nulton J, Mahaffy J, Rohwer F. PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information. BMC Bioinformatics, 2005, 6: 41.
Schloss P D. Evaluating different approaches that test whether microbial communities have the same structure. ISME J, 2008, 2(3): 265–275.
Schloss P D, Handelsman J. Introducing SONS, a tool for operational taxonomic unit-based comparisons of microbial community memberships and structures. Appl. Environ. Microbiol., 2006, 72(10): 6773–6779.
White J, Nagarajan N, Pop M. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput. Biol., 2009, 5(4): e1000352.
Zaneveld J, Turnbaugh P J, Lozupone C, Ley R E, Hamady M, Gordon J I, Knight R. Host-bacterial coevolution and the search for new drug targets. Curr. Opin. Chem. Biol., 2008, 12(1): 109–114.
Ley R E, Hamady M, Lozupone C, Turnbaugh P J, Ramey R R, Bircher J S, Schlegel M L, Tucker T A, Schrenzel M D, Knight R, Gordon J I. Evolution of mammals and their gut microbes. Science, 2008, 320(5883): 1647–1651.
Shannon P, Markiel A, Ozier O, Baliga N S, Wang J T, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res., 2003, 13(11): 2498–2504.
Rusch D B, Halpern A L, Sutton G et al. The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through eastern tropical Pacific. PLoS Biol., 2007, 5(3): e77.
Ashelford K E, Chuzhanova N A, Fry J C, Jones A J, Weightman A J. At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl. Environ. Microbiol., 2005, 71(12): 7724–7736.
Williams R, Peisajovich S, Miller O, Magdassi S, Tawfik D, Griffiths A. Amplification of complex gene libraries by emulsion PCR. Nat. Methods, 2006, 3(7): 545–550.
Huber T, Faulkner G, Hugenholz P. Bellerophon: A program to detect chimeric sequences in multiple sequence alignments. Bioinformatics, 2004, 20(14): 2317–2319.
Ashelford K E, Chuzhanova N A, Fry J C, Jones A J, Weightman A J. New screening software shows that most recent large 16S rRNA gene clone libraries contain chimeras. Appl. Environ. Microbiol., 2006, 72(9): 5734–5741.
Gomez-Alvarez V, Teal T, Schmidt T. Systematic artifacts in metagenomes from complex microbial communities. ISME J, 2009, 3(11): 1314–1317.
Sharon I, Pati A, Markowitz V M, Pintter R Y. A statistical framework for the functional analysis of metagenomes. In Proc. RECOMB 2009, Tucson, USA, May 18–21, 2009, pp.496–511.
Lander E S, Waterman M S. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics, 1988, 2: 231–239.
Ye Y, Doak T G. A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes. PLoS Comput. Biol., 2009, 5(8): e1000465.
Okuda S, Yamada T, Hamajima M, Itoh M, Katayama T, Bork P, Goto S, Kanehisa M. KEGG Atlas mapping for global analysis of metabolic pathways. Nucleic Acids Res., 2008, 36(Web Server Issue): W423–W426.
Rosin F M, Watanabe N, Lam E. Moonlighting vacuolar protease: Multiple jobs for a busy protein. Trends Plant Sci., 2005, 10(11): 516–518.
Seshadri R, Kravitz S A, Smarr L, Gilna P, Frazier M. CAMERA: A community resource for metagenomics. PLoS Biol., 2007, 5(3): e75.
Price M N, Dehal P S, Arkin A P. FastBLAST: Homology relationships for millions of proteins. PLoS One, 2008, 3(10): e3589.
Sun Y, Cai Y, Liu L, Yu F, Farrell M L, McKendree W, Farmerie W. ESPRIT: Estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res., 2009, 37(10): e76.
Shi Y, Tyson G W, DeLong E F. Metatranscriptomics reveals unique microbial small RNAs in the ocean’s water column. Nature, 2009, 459(7244): 266–269.
Verberkmoes N C, Russell A L, Shah M, Godzik A, Rosenquist M, Halfvarson J, Lefsrud M G, Apajalahti J, Tysk C, Hettich R L, Jansson J K. Shotgun metaproteomics of the human distal gut microbiota. ISME J, 2009, 3(2): 179–189.
Frias-Lopez J, Shi Y, Tyson G W, Coleman M L, Schuster S C, Chisholm S W, Delong E F. Microbial community gene expression in ocean surface waters. Proc. Natl. Acad. Sci. USA, 2008, 105(10): 3805–3810.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is supported by NIH under Grant No. 1R01HG004908-01, NSF of USA under Grant No. DBI-0845685 (YY), and also the Gordon and Betty Moore Foundation for the Community Cyberinfrastructure for Marine Microbial Ecological Research and Analysis (CAMERA) Project (JW).
Rights and permissions
About this article
Cite this article
Wooley, J.C., Ye, Y. Metagenomics: Facts and Artifacts, and Computational Challenges. J. Comput. Sci. Technol. 25, 71–81 (2010). https://doi.org/10.1007/s11390-010-9306-4
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-010-9306-4