Skip to main content

Using Protein Domains to Improve the Accuracy of Ab Initio Gene Finding

  • Conference paper
Book cover Algorithms in Bioinformatics (WABI 2007)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4645))

Included in the following conference series:

  • 1077 Accesses

Abstract

Background: Protein domains are the common functional elements used by nature to generate tremendous diversity among proteins, and they are used repeatedly in different combinations across all major domains of life. In this paper we address the problem of using similarity to known protein domains in helping with the identification of genes in a DNA sequence. We have adapted the generalized hidden Markov model (GHMM) architecture of the ab intio gene finder GlimmerHMM such that a higher probability is assigned to exons that contain homologues to protein domains. To our knowledge, this domain homology based approach has not been used previously in the context of ab initio gene prediction. Results: GlimmerHMM was augmented with a protein domain module that recognizes gene structures that are similar to Pfam models. The augmented system, GlimmerHMM+, shows 2% improvement in sensitivity and a 1% increase in specificity in predicting exact gene structures compared to GlimmerHMM without this option. These results were obtained on two very different model organisms: Arabidopsis thaliana (mustard wee) and Danio rerio (zebrafish), and together these preliminary results demonstrate the value of using protein domain homology in gene prediction. The results obtained are encouraging, and we believe that a more comprehensive approach including a model that reflects the statistical characteristics of specific sets of protein domain families would result in a greater increase of the accuracy of gene prediction. GlimmerHMM and GlimmerHMM+ are freely available as open source software at http://cbcb.umd.edu/software.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allen, J.E., Majoros, W.H., Pertea, M., Salzberg, S.L.: JIGSAW,GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biol. 7(suppl 1:S9), 1–13 (2006)

    Google Scholar 

  2. Ashurst, J.L., Chen, C.K., Gilbert, J.G., Jekosch, K., Keenan, S., Meidl, P., Searle, S.M., Stalker, J., Storey, R., Trevanion, S., Wilming, L., Hubbard, T.: The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res. 33(Database issue), D459–D465 (2005)

    Google Scholar 

  3. Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14(9), 755–763 (1998)

    Article  Google Scholar 

  4. Finn, R.D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S.R., Sonnhammer, E.L., Bateman, A.: Pfam: clans, web tools and services. Nucleic Acids Res. 34(Database issue), D247–D251 (2006)

    Google Scholar 

  5. Guigo, R., Flicek, P., Abril, J.F., Reymond, A., Lagarde, J., Denoeud, F., Antonarakis, S., Ashburner, M., Bajic, V.B., Birney, E., Castelo, R., Eyras, E., Ucla, C., Gingeras, T.R., Harrow, J., Hubbard, T., Lewis, S.E., Reese, M.G.: EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 7(Suppl 1:S2), 1–31 (2006)

    Google Scholar 

  6. Haas, B.J., Volfovsky, N., Town, C.D., Troukhan, M., Alexandrov, N., Feldmann, K.A., Flavell, R.B., White, O., Salzberg, S.L.: Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol. 3(6), RESEARCH0029 (2002)

    Google Scholar 

  7. Krogh, A.: Using database matches with for HMMGene for automated gene detection in Drosophila. Genome Res. 10(4), 523–528 (2000)

    Article  Google Scholar 

  8. Majoros, W.H., Pertea, M., Salzberg, S.L.: TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20(16), 2878–2879 (2004)

    Article  Google Scholar 

  9. Ponting, C.P., Russell, R.R.: The natural history of protein domains. Annu. Rev. Biophys. Biomol. Struct. 31, 45–71 (2002)

    Article  Google Scholar 

  10. Reese, M.G., Kulp, D., Tammana, H., Haussler, D.: Genie–gene finding in Drosophila melanogaster. Genome Res. 10(4), 529–538 (2000)

    Article  Google Scholar 

  11. Solovyev, V., Kosarev, P., Seledsov, I., Vorobyev, D.: Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 7(Suppl 1:S10), 1–12 (2006)

    Google Scholar 

  12. Wei, C., Brent, M.R.: Using ESTs to improve the accuracy of de novo gene prediction. BMC Bioinformatics 7, 327 (2006)

    Article  Google Scholar 

  13. Zhang, M.Q.: Computational prediction of eukaryotic protein-coding genes. Nat. Rev. Genet. 3(9), 698–709 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Raffaele Giancarlo Sridhar Hannenhalli

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pertea, M., Salzberg, S.L. (2007). Using Protein Domains to Improve the Accuracy of Ab Initio Gene Finding. In: Giancarlo, R., Hannenhalli, S. (eds) Algorithms in Bioinformatics. WABI 2007. Lecture Notes in Computer Science(), vol 4645. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74126-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-74126-8_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74125-1

  • Online ISBN: 978-3-540-74126-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics