Abstract
Current methods for computationally predicting the locations and intron-exon structures of protein-coding genes in eukaryotic DNA are largely based on probabilistic, state-based generative models such as hidden Markov models and their various extensions. Unfortunately, little attention has been paid to the optimality of these models for the gene-parsing problem. Furthermore, as the prevalence of alternative splicing in human genes becomes more apparent, the “one gene, one parse” discipline endorsed by virtually all current gene-finding systems becomes less attractive from a biomedical perspective. Because our ability to accurately identify all the isoforms of each gene in the genome is of direct importance to biomedicine, our ability to improve gene-finding accuracy both for human and non-human DNA clearly has a potential to significantly impact human health. In this paper we review current methods and suggest a number of possible directions for further research that may alleviate some of these problems and ultimately lead to better and more useful gene predictions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Davuluri, R.V., Grosse, I., Zhang, M.Q.: Computational identification of promoters and first exons in the human genome. Nature Genetics 29, 412–417 (2001)
Viterbi, A.: Error bounds for convolutional codes and an assymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 260-269 (1967)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society (Series B) 39, 1–38 (1977)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–286 (1989)
Kulp, D., et al.: A generalized hidden Markov model for the recognition of human genes in DNA. In: ISMB ’96 (1996)
Majoros, W.M., et al.: Efficient decoding algorithms for generalized hidden Markov model gene finders. BMC Bioinformatics 6, 16 (2005)
Salzberg, S.L., et al.: Interpolated Markov models for eukaryotic gene finding. Genomics 59, 24–31 (1998)
Staden, R.: Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Research 12, 505–519 (1984)
Zhang, M.Q., Marr, T.G.: A weight array method for splicing signal analysis. Computer Applications in the Biosciences 9, 499–509 (1993)
Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)
Alexandersson, M., Cawley, S., Pachter, L.: SLAM: Cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Research 13, 496–502 (2003)
Majoros, W.M., Pertea, M., Salzberg, S.L.: Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics 21, 1782–1788 (2005)
Felsenstein, J.: Evolutionary trees from DNA sequences. Journal of Molecular Evolution 17, 368–376 (1981)
Durbin, R., et al.: Biological sequence analysis. Cambridge University Press, Cambridge (1998)
Siepel, A., Haussler, D.: Computational identification of evolutionarily conserved exons. In: RECOMB’04, San Diego, March 27-31 (2004)
Guigó, R., et al.: EGASP: The human ENCODE genome annotation assessment project. Genome Biology 7(Suppl. 1), 2 (2006)
Allen, J.E., et al.: JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biology 7(Suppl. 1), S9 (2006)
Bahl, L.R., et al.: Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing 1986, pp. 49–52 (1986)
Reichl, W., Ruske, G.: Discriminative training for continuous speech recognition. In: Proceedings of the Fourth European Conference on Speech Communication and Technology (EUROSPEECH-95), Madrid, 18-21 September, pp. 537–540. Institute of Phonetic Sciences, Amsterdam (1995)
Normandin, Y.: Maximum mutual information estimation of hidden Markov models. In: Automatic Speech and Speaker Recognition, pp. 58–81. Kluwer Academic Publishers, Norwell (1996)
Krogh, A.: Two methods for improving performance of an HMM and their application for gene finding. In: Gaasterland, T., et al. (eds.) Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology, pp. 179–186. American Association for Artificial Intelligence, Menlo Park (1997)
Gross, S.S., Brent, M.R.: Using multiple alignments to improve gene prediction. In: Miyano, S., et al. (eds.) Research in Computational Molecular Biology. LNCS (LNBI), vol. 3500, pp. 374–388. Springer, Heidelberg (2005)
Majoros, W.M., Salzberg, S.L.: An empirical analysis of training protocols for probabilistic gene finders. BMC Bioinformatics 5, 206 (2004)
Vinson, J., et al.: Gene prediction using conditional random fields (abstract). In: The Biology of Genomes, Cold Spring Harbor Laboratory, New York, May 10-14 (2006)
Culotta, A., Kulp, D., McCallum, A.: Gene prediction with conditional random fields. Technical Report UM-CS-2005-028. University of Massachusetts, Amherst (2005)
Fariselli, P., Martelli, P.L., Casadio, R.: The posterior-Viterbi: a new decoding algorithm for hidden Markov models. BMC Bioinformatics 6 Suppl 4:S 6(Suppl. 4), S12 (2005)
Käll, L., Krogh, A., Sonnhammer, E.L.L.: An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics 21(Suppl. 1), i251–i257 (2005)
Stanke, M., Waack, S.: Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, II215–II225 (2003)
Korf, I.: Gene finding in novel Genomes. BMC Bioinformatics 5, 59 (2004)
Castellano, S., et al.: Diversity and functional plasticity of eukaryotic selenoproteins: Identification and characterization of the SelJ family. Proc. Natl. Acad. Sci. 102, 16188–16193 (2005)
Delcher, A., et al.: Improved microbial gene identification with GLIMMER. Nucleic Acids Research 27, 4636–4641 (1999)
Shmatkov, A.M., et al.: Finding prokyarotic genes by the ’frame-by-frame’ algorithm: targeting gene starts and overlapping genes. Bioinformatics 15, 874–886 (1999)
McCauley, S., Hein, J.: Using hidden Markov models and observed evolution to annotate viral genomes. Bioinformatics 22, 1308–1316 (2006)
Misra, S., et al.: Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome Biology 3:RESEARCH0083 (2002)
Thanaraj, T.A., et al.: ASD: the Alternative Splicing Database. Nucleic Acids Research 32, D64–D69 (2004)
Wojtowicz, W.M., et al.: Alternative splicing of Drosophila Dscam generates axon guidance receptors that exhibit isoform-specific homophilic binding. Cell 118, 619–633 (2004)
Parra, G., et al.: Tandem chimerism as a means to increase protein complexity in the human genome. Genome Research 16, 37–44 (2006)
Cawley, S.E., Pachter, L.: HMM sampling and applications to gene finding and alternative splicing. In: ECCB 2003, pp. 36–41 (2003)
Dror, G., Sorek, R., Shamir, R.: Accurate identification of alternatively spliced exons using support vector machines. Bioinformatics 21, 897–901 (2004)
Yeo, G.W., et al.: Identification and analysis of alternative splicing events conserved in human and mouse. PNAS 102, 2850–2855 (2005)
Rätsch, G., Sonnenburg, S., Schölkopf, B.: RASE: recognition of alternatively spliced exons in C.elegans. Bioinformatics 21(Suppl. 1), i369–377 (2005)
Ohler, U., Shomron, N., Burge, C.B.: Recognition of unknown conserved alternatively spliced exons. PLoS Computational Biology 1, 113–122 (2005)
Wang, Z., et al.: Systematic identification and analysis of exonic splicing silencers. Cell 119, 831–845 (2004)
Pertea, M., Salzberg, S.L.: Computational gene finding in plants. Plant Molecular Biology 48, 39–48 (2002)
Uberbacher, E.C., Mural, R.J.: Locating protein coding regions in human DNA sequences using a multiple-sensor neural network approach. PNAS 88, 11261–11265 (1991)
Vapnik, V.: Statistical Learning Theory. John Wiley and Sons, Chichester (1998)
Jaakkola, T.S., Haussler, D.: Exploiting generative models in discriminative classifiers. Advances in Neural Information Processing Systems 11, 487–493 (1999)
Zien, A.: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799–807 (2000)
Sun, Y.F., Fan, X.D., Li, Y.D.: Identifying splicing sites in eukaryotic RNA: support vector machine approach. Comput. Biol. Med. 33, 17–29 (2003)
Bedell, J.A., Korf, I., Gish, W.: MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics 16, 1040–1041 (2000)
Heber, S., et al.: Splicing graphs and EST assembly problem. Bioinformatics 18(Suppl. 1), S181–188 (2002)
Karolchik, D., et al.: The UCSC genome browser database. Nucleic Acids Research 31, 51–54 (2003)
Reese, M.G., et al.: Improved splice site detection in Genie. Journal of Computational Biology 4, 311–323 (1997)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Majoros, W.H., Ohler, U. (2007). Advancing the State of the Art in Computational Gene Prediction. In: Tuyls, K., Westra, R., Saeys, Y., Nowé, A. (eds) Knowledge Discovery and Emergent Complexity in Bioinformatics. KDECB 2006. Lecture Notes in Computer Science(), vol 4366. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71037-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-71037-0_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71036-3
Online ISBN: 978-3-540-71037-0
eBook Packages: Computer ScienceComputer Science (R0)