Ab Initio Gene Identification in Metagenomic Sequences

Tang, Shiyuyun; Borodovsky, Mark

doi:10.1007/978-1-4614-6418-1_440-1

Shiyuyun Tang² &
Mark Borodovsky³

288 Accesses
2 Citations

Synonyms

Statistical or intrinsic methods of gene prediction

Definition

Computational inference of how a metagenomic sequence is divided into protein-coding and noncoding regions based on presence or absence of characteristic oligonucleotide frequency patterns.

Introduction

As of April 2013 sequences of 370 metagenomes were available in databases. On the other hand, Genomes Online Database (www.genomesonline.org) lists 186 complete archaeal and 3,956 complete bacterial genomes; also there are about 15,000 incomplete (draft) prokaryotic genomes. With the average size of a metagenome being 100 times larger than an average prokaryotic genome, the current volume of metagenomic sequences is twice as large as the total sequence in “genomic” data. Therefore, current metagenomes carry a larger wealth of genes than all the prokaryotic genomes, and this gap is growing.

Notably, gene prediction and annotation of gene and protein function is more challenging in metagenomes than in draft or...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.
Article CAS PubMed Central PubMed Google Scholar
Antonov I, Borodovsky M. Genetack: frameshift identification in protein-coding sequences by the viterbi algorithm. J Bioinforma Comput Biol. 2010;8(3):535–51. PubMed PMID: 20556861.
Article CAS Google Scholar
Badger JH, Olsen GJ. CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol. 1999;16(4):512–24.
Article CAS PubMed Google Scholar
Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings/International Conference on Intelligent Systems for Molecular Biology; ISMB International Conference on Intelligent Systems for Molecular Biology, Vol. 2; 1994; p. 28–36. PubMed PMID: 7584402.
Google Scholar
Besemer J, Borodovsky M. Heuristic approach to deriving models for gene finding. Nucleic Acids Res. 1999;27(19):3911–20. PubMed PMID: 10481031. Pubmed Central PMCID: 148655.
Article CAS PubMed Central PubMed Google Scholar
Besemer J, Lomsadze A, Borodovsky M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 2001;29(12):2607–18. PubMed PMID: 11410670. Pubmed Central PMCID: 55746.
Article CAS PubMed Central PubMed Google Scholar
Borodovsky M, McIninch J. GENMARK: parallel gene recognition for both DNA strands. Comp Chem. 1993;17(2):123–33.
Article CAS Google Scholar
Borodovsky MY, Sprizhitskii Y, Golovanov E, Aleksandrov A. Statistical patterns in primary structures of functional regions in the E. coli genome. III. Computer recognition of coding regions. Mol Biol. 1986;20:1145–50.
Google Scholar
Brady A, Salzberg SL. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods. 2009;6(9):673–6. PubMed PMID: 19648916. Pubmed Central PMCID: 2762791.
Article CAS PubMed Central PubMed Google Scholar
Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, et al. Complete genome sequence of the methanogenic archaeon. Methanococcus jannaschii. Science. 1996;273(5278):1058–73. PubMed PMID: 8688087.
Article CAS PubMed Google Scholar
Chen SL, Lee W, Hottes AK, Shapiro L, McAdams HH. Codon usage between genomes is constrained by genome-wide mutational processes. Proc Natl Acad Sci U S A. 2004;101(10):3480–5. PubMed PMID: 14990797. Pubmed Central PMCID: 373487.
Article CAS PubMed Central PubMed Google Scholar
Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007;23(6):673–9. PubMed PMID: 17237039. Pubmed Central PMCID: 2387122.
Article CAS PubMed Central PubMed Google Scholar
Frishman D, Mironov A, Mewes H-W, Gelfand M. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res. 1998;26(12):2941–7.
Article CAS PubMed Central PubMed Google Scholar
Gish W, States DJ. Identification of protein coding regions by database similarity search. Nat Genet. 1993;3(3):266–72.
Article CAS PubMed Google Scholar
Hoff KJ. The effect of sequencing errors on metagenomic gene prediction. BMC Genomics. 2009;10:520. PubMed PMID: 19909532. Pubmed Central PMCID: 2781827.
Article PubMed Central PubMed Google Scholar
Hoff KJ, Tech M, Lingner T, Daniel R, Morgenstern B, Meinicke P. Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinforma. 2008;9:217. PubMed PMID: 18442389. Pubmed Central PMCID: 2409338.
Article Google Scholar
Hoff KJ, Lingner T, Meinicke P, Tech M. Orphelia: predicting genes in metagenomic sequencing reads. Nucleic Acids Res. 2009 Jul 37(Web Server issue):W101-5. PubMed PMID: 19429689. Pubmed Central PMCID: 2703946.
Google Scholar
Kelley DR, Liu B, Delcher AL, Pop M, Salzberg SL. Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Res. 2012;40(1):e9. PubMed PMID: 22102569. Pubmed Central PMCID: 3245904.
Article CAS PubMed Central PubMed Google Scholar
Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A bioinformatician’s guide to metagenomics. Microbiol Mol Biol Rev. 2008;72(4):557–78. Table of Contents. PubMed PMID: 19052320. Pubmed Central PMCID: 2593568.
Article CAS PubMed Central PubMed Google Scholar
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993;262(5131):208–14. PubMed PMID: 8211139.
Article CAS PubMed Google Scholar
Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis KT. Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. PloS ONE. 2012;7(2):e30087.
Article CAS PubMed Central PubMed Google Scholar
Noguchi H, Park J, Takagi T. MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res. 2006;34(19):5623–30. PubMed PMID: 17028096. Pubmed Central PMCID: 1636498.
Article CAS PubMed Central PubMed Google Scholar
Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res Int J Rapid Publ Rep Genes Genomes. 2008;15(6):387–96. PubMed PMID: 18940874. Pubmed Central PMCID: 2608843.
CAS Google Scholar
Rho M, Tang H, Ye Y. FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res. 2010;38(20):e191. PubMed PMID: 20805240. Pubmed Central PMCID: 2978382.
Article PubMed Central PubMed Google Scholar
Salzberg SL, Delcher AL, Kasif S, White O. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 1998;26(2):544–8. PubMed PMID: 9421513. Pubmed Central PMCID: 147303.
Article CAS PubMed Central PubMed Google Scholar
Tang S, Antonov I, Borodovsky M. MetaGeneTack: ab initio detection of frameshifts in metagenomic sequences. Bioinformatics. 2013;29(1):114–6. PubMed PMID: 23129300. Pubmed Central PMCID: 3530910.
Article CAS PubMed Central PubMed Google Scholar
Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010;6(2):e1000667. PubMed PMID: 20195499. Pubmed Central PMCID: 2829047.
Article PubMed Central PubMed Google Scholar
Yok NG, Rosen GL. Combining gene prediction methods to improve metagenomic gene annotation. BMC Bioinforma. 2011;12:20. PubMed PMID: 21232129. Pubmed Central PMCID: 3042383.
Article Google Scholar
Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 2010;38(12):e132. PubMed PMID: 20403810. Pubmed Central PMCID: 2896542.
Article PubMed Central PubMed Google Scholar

Download references

Author information

Authors and Affiliations

School of Biology, Georgia Institute of Technology, Atlanta, GA, 30332, USA
Shiyuyun Tang
Joint Georgia Tech and Emory Wallace H Coulter Department of Biomedical Engineering, Center for Bioinformatics and Computational Genomics, Atlanta, GA, 30332, USA
Mark Borodovsky

Authors

Shiyuyun Tang
View author publications
You can also search for this author in PubMed Google Scholar
Mark Borodovsky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark Borodovsky .

Editor information

Editors and Affiliations

J. Craig Venter Institute (JCVI), Rockville, Maryland, USA
Karen E. Nelson

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Tang, S., Borodovsky, M. (2013). Ab Initio Gene Identification in Metagenomic Sequences. In: Nelson, K. (eds) Encyclopedia of Metagenomics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6418-1_440-1

Download citation

DOI: https://doi.org/10.1007/978-1-4614-6418-1_440-1
Received: 28 July 2013
Accepted: 28 July 2013
Published: 15 April 2014
Publisher Name: Springer, New York, NY
Online ISBN: 978-1-4614-6418-1
eBook Packages: Springer Reference Biomedicine and Life SciencesReference Module Biomedical and Life Sciences

Publish with us

Policies and ethics