Advancing the State of the Art in Computational Gene Prediction

Majoros, William H.; Ohler, Uwe

doi:10.1007/978-3-540-71037-0_6

William H. Majoros¹ &
Uwe Ohler¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4366))

Included in the following conference series:

International Workshop on Knowledge Discovery and Emergent Complexity in Bioinformatics

499 Accesses

Abstract

Current methods for computationally predicting the locations and intron-exon structures of protein-coding genes in eukaryotic DNA are largely based on probabilistic, state-based generative models such as hidden Markov models and their various extensions. Unfortunately, little attention has been paid to the optimality of these models for the gene-parsing problem. Furthermore, as the prevalence of alternative splicing in human genes becomes more apparent, the “one gene, one parse” discipline endorsed by virtually all current gene-finding systems becomes less attractive from a biomedical perspective. Because our ability to accurately identify all the isoforms of each gene in the genome is of direct importance to biomedicine, our ability to improve gene-finding accuracy both for human and non-human DNA clearly has a potential to significantly impact human health. In this paper we review current methods and suggest a number of possible directions for further research that may alleviate some of these problems and ultimately lead to better and more useful gene predictions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Davuluri, R.V., Grosse, I., Zhang, M.Q.: Computational identification of promoters and first exons in the human genome. Nature Genetics 29, 412–417 (2001)
Article Google Scholar
Viterbi, A.: Error bounds for convolutional codes and an assymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 260-269 (1967)
Google Scholar
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society (Series B) 39, 1–38 (1977)
MathSciNet MATH Google Scholar
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–286 (1989)
Article Google Scholar
Kulp, D., et al.: A generalized hidden Markov model for the recognition of human genes in DNA. In: ISMB ’96 (1996)
Google Scholar
Majoros, W.M., et al.: Efficient decoding algorithms for generalized hidden Markov model gene finders. BMC Bioinformatics 6, 16 (2005)
Article Google Scholar
Salzberg, S.L., et al.: Interpolated Markov models for eukaryotic gene finding. Genomics 59, 24–31 (1998)
Article Google Scholar
Staden, R.: Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Research 12, 505–519 (1984)
Article Google Scholar
Zhang, M.Q., Marr, T.G.: A weight array method for splicing signal analysis. Computer Applications in the Biosciences 9, 499–509 (1993)
Google Scholar
Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)
Article Google Scholar
Alexandersson, M., Cawley, S., Pachter, L.: SLAM: Cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Research 13, 496–502 (2003)
Article Google Scholar
Majoros, W.M., Pertea, M., Salzberg, S.L.: Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics 21, 1782–1788 (2005)
Article Google Scholar
Felsenstein, J.: Evolutionary trees from DNA sequences. Journal of Molecular Evolution 17, 368–376 (1981)
Article Google Scholar
Durbin, R., et al.: Biological sequence analysis. Cambridge University Press, Cambridge (1998)
MATH Google Scholar
Siepel, A., Haussler, D.: Computational identification of evolutionarily conserved exons. In: RECOMB’04, San Diego, March 27-31 (2004)
Google Scholar
Guigó, R., et al.: EGASP: The human ENCODE genome annotation assessment project. Genome Biology 7(Suppl. 1), 2 (2006)
Article Google Scholar
Allen, J.E., et al.: JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biology 7(Suppl. 1), S9 (2006)
Article Google Scholar
Bahl, L.R., et al.: Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing 1986, pp. 49–52 (1986)
Google Scholar
Reichl, W., Ruske, G.: Discriminative training for continuous speech recognition. In: Proceedings of the Fourth European Conference on Speech Communication and Technology (EUROSPEECH-95), Madrid, 18-21 September, pp. 537–540. Institute of Phonetic Sciences, Amsterdam (1995)
Google Scholar
Normandin, Y.: Maximum mutual information estimation of hidden Markov models. In: Automatic Speech and Speaker Recognition, pp. 58–81. Kluwer Academic Publishers, Norwell (1996)
Google Scholar
Krogh, A.: Two methods for improving performance of an HMM and their application for gene finding. In: Gaasterland, T., et al. (eds.) Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology, pp. 179–186. American Association for Artificial Intelligence, Menlo Park (1997)
Google Scholar
Gross, S.S., Brent, M.R.: Using multiple alignments to improve gene prediction. In: Miyano, S., et al. (eds.) Research in Computational Molecular Biology. LNCS (LNBI), vol. 3500, pp. 374–388. Springer, Heidelberg (2005)
Google Scholar
Majoros, W.M., Salzberg, S.L.: An empirical analysis of training protocols for probabilistic gene finders. BMC Bioinformatics 5, 206 (2004)
Article Google Scholar
Vinson, J., et al.: Gene prediction using conditional random fields (abstract). In: The Biology of Genomes, Cold Spring Harbor Laboratory, New York, May 10-14 (2006)
Google Scholar
Culotta, A., Kulp, D., McCallum, A.: Gene prediction with conditional random fields. Technical Report UM-CS-2005-028. University of Massachusetts, Amherst (2005)
Google Scholar
Fariselli, P., Martelli, P.L., Casadio, R.: The posterior-Viterbi: a new decoding algorithm for hidden Markov models. BMC Bioinformatics 6 Suppl 4:S 6(Suppl. 4), S12 (2005)
Article Google Scholar
Käll, L., Krogh, A., Sonnhammer, E.L.L.: An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics 21(Suppl. 1), i251–i257 (2005)
Article Google Scholar
Stanke, M., Waack, S.: Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, II215–II225 (2003)
Article Google Scholar
Korf, I.: Gene finding in novel Genomes. BMC Bioinformatics 5, 59 (2004)
Article Google Scholar
Castellano, S., et al.: Diversity and functional plasticity of eukaryotic selenoproteins: Identification and characterization of the SelJ family. Proc. Natl. Acad. Sci. 102, 16188–16193 (2005)
Article Google Scholar
Delcher, A., et al.: Improved microbial gene identification with GLIMMER. Nucleic Acids Research 27, 4636–4641 (1999)
Article Google Scholar
Shmatkov, A.M., et al.: Finding prokyarotic genes by the ’frame-by-frame’ algorithm: targeting gene starts and overlapping genes. Bioinformatics 15, 874–886 (1999)
Article Google Scholar
McCauley, S., Hein, J.: Using hidden Markov models and observed evolution to annotate viral genomes. Bioinformatics 22, 1308–1316 (2006)
Article Google Scholar
Misra, S., et al.: Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biology 3:RESEARCH0083 (2002)
Google Scholar
Thanaraj, T.A., et al.: ASD: the Alternative Splicing Database. Nucleic Acids Research 32, D64–D69 (2004)
Article Google Scholar
Wojtowicz, W.M., et al.: Alternative splicing of Drosophila Dscam generates axon guidance receptors that exhibit isoform-specific homophilic binding. Cell 118, 619–633 (2004)
Article Google Scholar
Parra, G., et al.: Tandem chimerism as a means to increase protein complexity in the human genome. Genome Research 16, 37–44 (2006)
Article Google Scholar
Cawley, S.E., Pachter, L.: HMM sampling and applications to gene finding and alternative splicing. In: ECCB 2003, pp. 36–41 (2003)
Google Scholar
Dror, G., Sorek, R., Shamir, R.: Accurate identification of alternatively spliced exons using support vector machines. Bioinformatics 21, 897–901 (2004)
Article Google Scholar
Yeo, G.W., et al.: Identification and analysis of alternative splicing events conserved in human and mouse. PNAS 102, 2850–2855 (2005)
Article Google Scholar
Rätsch, G., Sonnenburg, S., Schölkopf, B.: RASE: recognition of alternatively spliced exons in C.elegans. Bioinformatics 21(Suppl. 1), i369–377 (2005)
Article Google Scholar
Ohler, U., Shomron, N., Burge, C.B.: Recognition of unknown conserved alternatively spliced exons. PLoS Computational Biology 1, 113–122 (2005)
Article Google Scholar
Wang, Z., et al.: Systematic identification and analysis of exonic splicing silencers. Cell 119, 831–845 (2004)
Article Google Scholar
Pertea, M., Salzberg, S.L.: Computational gene finding in plants. Plant Molecular Biology 48, 39–48 (2002)
Article Google Scholar
Uberbacher, E.C., Mural, R.J.: Locating protein coding regions in human DNA sequences using a multiple-sensor neural network approach. PNAS 88, 11261–11265 (1991)
Article Google Scholar
Vapnik, V.: Statistical Learning Theory. John Wiley and Sons, Chichester (1998)
MATH Google Scholar
Jaakkola, T.S., Haussler, D.: Exploiting generative models in discriminative classifiers. Advances in Neural Information Processing Systems 11, 487–493 (1999)
Google Scholar
Zien, A.: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799–807 (2000)
Article Google Scholar
Sun, Y.F., Fan, X.D., Li, Y.D.: Identifying splicing sites in eukaryotic RNA: support vector machine approach. Comput. Biol. Med. 33, 17–29 (2003)
Article Google Scholar
Bedell, J.A., Korf, I., Gish, W.: MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics 16, 1040–1041 (2000)
Article Google Scholar
Heber, S., et al.: Splicing graphs and EST assembly problem. Bioinformatics 18(Suppl. 1), S181–188 (2002)
Google Scholar
Karolchik, D., et al.: The UCSC genome browser database. Nucleic Acids Research 31, 51–54 (2003)
Article Google Scholar
Reese, M.G., et al.: Improved splice site detection in Genie. Journal of Computational Biology 4, 311–323 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Center for Bioinformatics and Computational Biology, Institute for Genome Sciences and Policy, Duke University, 101 Science Drive, Durham, NC 27708, USA
William H. Majoros & Uwe Ohler

Authors

William H. Majoros
View author publications
You can also search for this author in PubMed Google Scholar
Uwe Ohler
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Karl Tuyls Ronald Westra Yvan Saeys Ann Nowé

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Majoros, W.H., Ohler, U. (2007). Advancing the State of the Art in Computational Gene Prediction. In: Tuyls, K., Westra, R., Saeys, Y., Nowé, A. (eds) Knowledge Discovery and Emergent Complexity in Bioinformatics. KDECB 2006. Lecture Notes in Computer Science(), vol 4366. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71037-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-540-71037-0_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71036-3
Online ISBN: 978-3-540-71037-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics