Skip to main content

Advancing the State of the Art in Computational Gene Prediction

  • Conference paper
Book cover Knowledge Discovery and Emergent Complexity in Bioinformatics (KDECB 2006)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4366))

  • 499 Accesses

Abstract

Current methods for computationally predicting the locations and intron-exon structures of protein-coding genes in eukaryotic DNA are largely based on probabilistic, state-based generative models such as hidden Markov models and their various extensions. Unfortunately, little attention has been paid to the optimality of these models for the gene-parsing problem. Furthermore, as the prevalence of alternative splicing in human genes becomes more apparent, the “one gene, one parse” discipline endorsed by virtually all current gene-finding systems becomes less attractive from a biomedical perspective. Because our ability to accurately identify all the isoforms of each gene in the genome is of direct importance to biomedicine, our ability to improve gene-finding accuracy both for human and non-human DNA clearly has a potential to significantly impact human health. In this paper we review current methods and suggest a number of possible directions for further research that may alleviate some of these problems and ultimately lead to better and more useful gene predictions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Davuluri, R.V., Grosse, I., Zhang, M.Q.: Computational identification of promoters and first exons in the human genome. Nature Genetics 29, 412–417 (2001)

    Article  Google Scholar 

  2. Viterbi, A.: Error bounds for convolutional codes and an assymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 260-269 (1967)

    Google Scholar 

  3. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society (Series B) 39, 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  4. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–286 (1989)

    Article  Google Scholar 

  5. Kulp, D., et al.: A generalized hidden Markov model for the recognition of human genes in DNA. In: ISMB ’96 (1996)

    Google Scholar 

  6. Majoros, W.M., et al.: Efficient decoding algorithms for generalized hidden Markov model gene finders. BMC Bioinformatics 6, 16 (2005)

    Article  Google Scholar 

  7. Salzberg, S.L., et al.: Interpolated Markov models for eukaryotic gene finding. Genomics 59, 24–31 (1998)

    Article  Google Scholar 

  8. Staden, R.: Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Research 12, 505–519 (1984)

    Article  Google Scholar 

  9. Zhang, M.Q., Marr, T.G.: A weight array method for splicing signal analysis. Computer Applications in the Biosciences 9, 499–509 (1993)

    Google Scholar 

  10. Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)

    Article  Google Scholar 

  11. Alexandersson, M., Cawley, S., Pachter, L.: SLAM: Cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Research 13, 496–502 (2003)

    Article  Google Scholar 

  12. Majoros, W.M., Pertea, M., Salzberg, S.L.: Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics 21, 1782–1788 (2005)

    Article  Google Scholar 

  13. Felsenstein, J.: Evolutionary trees from DNA sequences. Journal of Molecular Evolution 17, 368–376 (1981)

    Article  Google Scholar 

  14. Durbin, R., et al.: Biological sequence analysis. Cambridge University Press, Cambridge (1998)

    MATH  Google Scholar 

  15. Siepel, A., Haussler, D.: Computational identification of evolutionarily conserved exons. In: RECOMB’04, San Diego, March 27-31 (2004)

    Google Scholar 

  16. Guigó, R., et al.: EGASP: The human ENCODE genome annotation assessment project. Genome Biology 7(Suppl. 1), 2 (2006)

    Article  Google Scholar 

  17. Allen, J.E., et al.: JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biology 7(Suppl. 1), S9 (2006)

    Article  Google Scholar 

  18. Bahl, L.R., et al.: Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing 1986, pp. 49–52 (1986)

    Google Scholar 

  19. Reichl, W., Ruske, G.: Discriminative training for continuous speech recognition. In: Proceedings of the Fourth European Conference on Speech Communication and Technology (EUROSPEECH-95), Madrid, 18-21 September, pp. 537–540. Institute of Phonetic Sciences, Amsterdam (1995)

    Google Scholar 

  20. Normandin, Y.: Maximum mutual information estimation of hidden Markov models. In: Automatic Speech and Speaker Recognition, pp. 58–81. Kluwer Academic Publishers, Norwell (1996)

    Google Scholar 

  21. Krogh, A.: Two methods for improving performance of an HMM and their application for gene finding. In: Gaasterland, T., et al. (eds.) Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology, pp. 179–186. American Association for Artificial Intelligence, Menlo Park (1997)

    Google Scholar 

  22. Gross, S.S., Brent, M.R.: Using multiple alignments to improve gene prediction. In: Miyano, S., et al. (eds.) Research in Computational Molecular Biology. LNCS (LNBI), vol. 3500, pp. 374–388. Springer, Heidelberg (2005)

    Google Scholar 

  23. Majoros, W.M., Salzberg, S.L.: An empirical analysis of training protocols for probabilistic gene finders. BMC Bioinformatics 5, 206 (2004)

    Article  Google Scholar 

  24. Vinson, J., et al.: Gene prediction using conditional random fields (abstract). In: The Biology of Genomes, Cold Spring Harbor Laboratory, New York, May 10-14 (2006)

    Google Scholar 

  25. Culotta, A., Kulp, D., McCallum, A.: Gene prediction with conditional random fields. Technical Report UM-CS-2005-028. University of Massachusetts, Amherst (2005)

    Google Scholar 

  26. Fariselli, P., Martelli, P.L., Casadio, R.: The posterior-Viterbi: a new decoding algorithm for hidden Markov models. BMC Bioinformatics 6 Suppl 4:S 6(Suppl. 4), S12 (2005)

    Article  Google Scholar 

  27. Käll, L., Krogh, A., Sonnhammer, E.L.L.: An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics 21(Suppl. 1), i251–i257 (2005)

    Article  Google Scholar 

  28. Stanke, M., Waack, S.: Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, II215–II225 (2003)

    Article  Google Scholar 

  29. Korf, I.: Gene finding in novel Genomes. BMC Bioinformatics 5, 59 (2004)

    Article  Google Scholar 

  30. Castellano, S., et al.: Diversity and functional plasticity of eukaryotic selenoproteins: Identification and characterization of the SelJ family. Proc. Natl. Acad. Sci. 102, 16188–16193 (2005)

    Article  Google Scholar 

  31. Delcher, A., et al.: Improved microbial gene identification with GLIMMER. Nucleic Acids Research 27, 4636–4641 (1999)

    Article  Google Scholar 

  32. Shmatkov, A.M., et al.: Finding prokyarotic genes by the ’frame-by-frame’ algorithm: targeting gene starts and overlapping genes. Bioinformatics 15, 874–886 (1999)

    Article  Google Scholar 

  33. McCauley, S., Hein, J.: Using hidden Markov models and observed evolution to annotate viral genomes. Bioinformatics 22, 1308–1316 (2006)

    Article  Google Scholar 

  34. Misra, S., et al.: Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biology 3:RESEARCH0083 (2002)

    Google Scholar 

  35. Thanaraj, T.A., et al.: ASD: the Alternative Splicing Database. Nucleic Acids Research 32, D64–D69 (2004)

    Article  Google Scholar 

  36. Wojtowicz, W.M., et al.: Alternative splicing of Drosophila Dscam generates axon guidance receptors that exhibit isoform-specific homophilic binding. Cell 118, 619–633 (2004)

    Article  Google Scholar 

  37. Parra, G., et al.: Tandem chimerism as a means to increase protein complexity in the human genome. Genome Research 16, 37–44 (2006)

    Article  Google Scholar 

  38. Cawley, S.E., Pachter, L.: HMM sampling and applications to gene finding and alternative splicing. In: ECCB 2003, pp. 36–41 (2003)

    Google Scholar 

  39. Dror, G., Sorek, R., Shamir, R.: Accurate identification of alternatively spliced exons using support vector machines. Bioinformatics 21, 897–901 (2004)

    Article  Google Scholar 

  40. Yeo, G.W., et al.: Identification and analysis of alternative splicing events conserved in human and mouse. PNAS 102, 2850–2855 (2005)

    Article  Google Scholar 

  41. Rätsch, G., Sonnenburg, S., Schölkopf, B.: RASE: recognition of alternatively spliced exons in C.elegans. Bioinformatics 21(Suppl. 1), i369–377 (2005)

    Article  Google Scholar 

  42. Ohler, U., Shomron, N., Burge, C.B.: Recognition of unknown conserved alternatively spliced exons. PLoS Computational Biology 1, 113–122 (2005)

    Article  Google Scholar 

  43. Wang, Z., et al.: Systematic identification and analysis of exonic splicing silencers. Cell 119, 831–845 (2004)

    Article  Google Scholar 

  44. Pertea, M., Salzberg, S.L.: Computational gene finding in plants. Plant Molecular Biology 48, 39–48 (2002)

    Article  Google Scholar 

  45. Uberbacher, E.C., Mural, R.J.: Locating protein coding regions in human DNA sequences using a multiple-sensor neural network approach. PNAS 88, 11261–11265 (1991)

    Article  Google Scholar 

  46. Vapnik, V.: Statistical Learning Theory. John Wiley and Sons, Chichester (1998)

    MATH  Google Scholar 

  47. Jaakkola, T.S., Haussler, D.: Exploiting generative models in discriminative classifiers. Advances in Neural Information Processing Systems 11, 487–493 (1999)

    Google Scholar 

  48. Zien, A.: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799–807 (2000)

    Article  Google Scholar 

  49. Sun, Y.F., Fan, X.D., Li, Y.D.: Identifying splicing sites in eukaryotic RNA: support vector machine approach. Comput. Biol. Med. 33, 17–29 (2003)

    Article  Google Scholar 

  50. Bedell, J.A., Korf, I., Gish, W.: MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics 16, 1040–1041 (2000)

    Article  Google Scholar 

  51. Heber, S., et al.: Splicing graphs and EST assembly problem. Bioinformatics 18(Suppl. 1), S181–188 (2002)

    Google Scholar 

  52. Karolchik, D., et al.: The UCSC genome browser database. Nucleic Acids Research 31, 51–54 (2003)

    Article  Google Scholar 

  53. Reese, M.G., et al.: Improved splice site detection in Genie. Journal of Computational Biology 4, 311–323 (1997)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Karl Tuyls Ronald Westra Yvan Saeys Ann Nowé

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Majoros, W.H., Ohler, U. (2007). Advancing the State of the Art in Computational Gene Prediction. In: Tuyls, K., Westra, R., Saeys, Y., Nowé, A. (eds) Knowledge Discovery and Emergent Complexity in Bioinformatics. KDECB 2006. Lecture Notes in Computer Science(), vol 4366. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71037-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-71037-0_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-71036-3

  • Online ISBN: 978-3-540-71037-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics