Skip to main content

Ab Initio Gene Identification in Metagenomic Sequences

  • Living reference work entry
  • First Online:
Encyclopedia of Metagenomics

Synonyms

Statistical or intrinsic methods of gene prediction

Definition

Computational inference of how a metagenomic sequence is divided into protein-coding and noncoding regions based on presence or absence of characteristic oligonucleotide frequency patterns.

Introduction

As of April 2013 sequences of 370 metagenomes were available in databases. On the other hand, Genomes Online Database (www.genomesonline.org) lists 186 complete archaeal and 3,956 complete bacterial genomes; also there are about 15,000 incomplete (draft) prokaryotic genomes. With the average size of a metagenome being 100 times larger than an average prokaryotic genome, the current volume of metagenomic sequences is twice as large as the total sequence in “genomic” data. Therefore, current metagenomes carry a larger wealth of genes than all the prokaryotic genomes, and this gap is growing.

Notably, gene prediction and annotation of gene and protein function is more challenging in metagenomes than in draft or...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Antonov I, Borodovsky M. Genetack: frameshift identification in protein-coding sequences by the viterbi algorithm. J Bioinforma Comput Biol. 2010;8(3):535–51. PubMed PMID: 20556861.

    Article  CAS  Google Scholar 

  • Badger JH, Olsen GJ. CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol. 1999;16(4):512–24.

    Article  CAS  PubMed  Google Scholar 

  • Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings/International Conference on Intelligent Systems for Molecular Biology; ISMB International Conference on Intelligent Systems for Molecular Biology, Vol. 2; 1994; p. 28–36. PubMed PMID: 7584402.

    Google Scholar 

  • Besemer J, Borodovsky M. Heuristic approach to deriving models for gene finding. Nucleic Acids Res. 1999;27(19):3911–20. PubMed PMID: 10481031. Pubmed Central PMCID: 148655.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Besemer J, Lomsadze A, Borodovsky M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 2001;29(12):2607–18. PubMed PMID: 11410670. Pubmed Central PMCID: 55746.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Borodovsky M, McIninch J. GENMARK: parallel gene recognition for both DNA strands. Comp Chem. 1993;17(2):123–33.

    Article  CAS  Google Scholar 

  • Borodovsky MY, Sprizhitskii Y, Golovanov E, Aleksandrov A. Statistical patterns in primary structures of functional regions in the E. coli genome. III. Computer recognition of coding regions. Mol Biol. 1986;20:1145–50.

    Google Scholar 

  • Brady A, Salzberg SL. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods. 2009;6(9):673–6. PubMed PMID: 19648916. Pubmed Central PMCID: 2762791.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, et al. Complete genome sequence of the methanogenic archaeon. Methanococcus jannaschii. Science. 1996;273(5278):1058–73. PubMed PMID: 8688087.

    Article  CAS  PubMed  Google Scholar 

  • Chen SL, Lee W, Hottes AK, Shapiro L, McAdams HH. Codon usage between genomes is constrained by genome-wide mutational processes. Proc Natl Acad Sci U S A. 2004;101(10):3480–5. PubMed PMID: 14990797. Pubmed Central PMCID: 373487.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007;23(6):673–9. PubMed PMID: 17237039. Pubmed Central PMCID: 2387122.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Frishman D, Mironov A, Mewes H-W, Gelfand M. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res. 1998;26(12):2941–7.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Gish W, States DJ. Identification of protein coding regions by database similarity search. Nat Genet. 1993;3(3):266–72.

    Article  CAS  PubMed  Google Scholar 

  • Hoff KJ. The effect of sequencing errors on metagenomic gene prediction. BMC Genomics. 2009;10:520. PubMed PMID: 19909532. Pubmed Central PMCID: 2781827.

    Article  PubMed Central  PubMed  Google Scholar 

  • Hoff KJ, Tech M, Lingner T, Daniel R, Morgenstern B, Meinicke P. Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinforma. 2008;9:217. PubMed PMID: 18442389. Pubmed Central PMCID: 2409338.

    Article  Google Scholar 

  • Hoff KJ, Lingner T, Meinicke P, Tech M. Orphelia: predicting genes in metagenomic sequencing reads. Nucleic Acids Res. 2009 Jul 37(Web Server issue):W101-5. PubMed PMID: 19429689. Pubmed Central PMCID: 2703946.

    Google Scholar 

  • Kelley DR, Liu B, Delcher AL, Pop M, Salzberg SL. Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Res. 2012;40(1):e9. PubMed PMID: 22102569. Pubmed Central PMCID: 3245904.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A bioinformatician’s guide to metagenomics. Microbiol Mol Biol Rev. 2008;72(4):557–78. Table of Contents. PubMed PMID: 19052320. Pubmed Central PMCID: 2593568.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993;262(5131):208–14. PubMed PMID: 8211139.

    Article  CAS  PubMed  Google Scholar 

  • Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis KT. Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. PloS ONE. 2012;7(2):e30087.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Noguchi H, Park J, Takagi T. MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res. 2006;34(19):5623–30. PubMed PMID: 17028096. Pubmed Central PMCID: 1636498.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res Int J Rapid Publ Rep Genes Genomes. 2008;15(6):387–96. PubMed PMID: 18940874. Pubmed Central PMCID: 2608843.

    CAS  Google Scholar 

  • Rho M, Tang H, Ye Y. FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res. 2010;38(20):e191. PubMed PMID: 20805240. Pubmed Central PMCID: 2978382.

    Article  PubMed Central  PubMed  Google Scholar 

  • Salzberg SL, Delcher AL, Kasif S, White O. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 1998;26(2):544–8. PubMed PMID: 9421513. Pubmed Central PMCID: 147303.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Tang S, Antonov I, Borodovsky M. MetaGeneTack: ab initio detection of frameshifts in metagenomic sequences. Bioinformatics. 2013;29(1):114–6. PubMed PMID: 23129300. Pubmed Central PMCID: 3530910.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010;6(2):e1000667. PubMed PMID: 20195499. Pubmed Central PMCID: 2829047.

    Article  PubMed Central  PubMed  Google Scholar 

  • Yok NG, Rosen GL. Combining gene prediction methods to improve metagenomic gene annotation. BMC Bioinforma. 2011;12:20. PubMed PMID: 21232129. Pubmed Central PMCID: 3042383.

    Article  Google Scholar 

  • Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 2010;38(12):e132. PubMed PMID: 20403810. Pubmed Central PMCID: 2896542.

    Article  PubMed Central  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark Borodovsky .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this entry

Cite this entry

Tang, S., Borodovsky, M. (2013). Ab Initio Gene Identification in Metagenomic Sequences. In: Nelson, K. (eds) Encyclopedia of Metagenomics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6418-1_440-1

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-6418-1_440-1

  • Received:

  • Accepted:

  • Published:

  • Publisher Name: Springer, New York, NY

  • Online ISBN: 978-1-4614-6418-1

  • eBook Packages: Springer Reference Biomedicine and Life SciencesReference Module Biomedical and Life Sciences

Publish with us

Policies and ethics