Abstract
A gene model algorithm integrates a wide range of scores, or signals, coming from the ingoing states of the model. These states are themselves complex submodels, which incorporate a number of sensors used to score the different characteristics of the submodel. Such sensors are traditionally divided into two groups: content sensors and signal sensors. Signal sensors model the transition between states, and attempt to detect the boundaries between exons and introns in the sequence, while content sensors score the content of a candidate region, such as the base composition or length distribution of a candidate exon or intron. In this chapter we describe some of the main submodels used in gene finding algorithms, and detail a number of different methods for integrating the sensors the submodels incorporate.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Aizerman, M., Braverman, E., Rozonoer, L.: Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control 25, 821–837 (1964)
Alexandersson, M., Cawley, S., Pachter, L.: SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res. 13, 496–502 (2003)
Axelson-Fisk, M., Sunnerhagen, P.: Gene finding in fungal genomes. In: Sunnerhagen, P., Piskur, J. (eds.) Topics in Current Genetics: Comparative Genomics Using Fungi as Models, pp. 1–29. Springer, Berlin (2005)
Bennetzen, J.L., Hall, B.D.: Codon selection in yeast. J. Biol. Chem. 257, 3026–3031 (1982)
Bernardi, G.: Isochores and the evolutionary genomics of vertebrates. Gene 241, 3–7 (2000)
Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J., Cuny, G., Menier-Rotival, M., Rodier, F.: The mosaic genome of warm-blooded vertebrates. Science 228, 953–958 (1985)
Biémont, C., Vieira, C.: Junk DNA as an evolutionary force. Nature 443, 521–524 (2006)
Bobbio, A., Horvath, A., Telek, M.: PhFit: a general phase-type fitting tool. Proc. Dep. Syst. Netw. (DSN-02) 1, 1 (2002)
Bobbio, A., Horvath, A., Scarpa, M., Telek, M.: Acyclic discrete phase type distributions: properties and a parameter estimation algorithm. Perform. Eval. 54, 1–32 (2003)
Brown, D.: A note on approximations to probability distributions. Inf. Control 2, 386–392 (1959)
Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T.S., Ares, M., Haussler, D.: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA 97, 262–267 (2000)
Brunak, S., Engelbrecht, J., Knudsen, S.: Prediction of human mRNA donor and acceptor sites from the DNA sequence. J. Mol. Biol. 220, 49–65 (1991)
Burge, C.: Identification of genes in human genomic DNA. Ph.D. thesis, Stanford University, Stanford (1997)
Burge, C.B.: Modeling dependencies in pre-mRNA splicing signals. In: Salzberg, S.L., Searls, D.B., Kasif, S. (eds.) Computational Methods in Molecular Biology, pp. 109–128. Elsevier, Amsterdam (1998)
Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997)
Bühlmann, P., Wyner, A.J.: Variable length Markov chains. Ann. Stat. 27, 480–513 (1999)
Castelo, R., Guigó, R.: Splice site identification with idlBNs. Bioinformatics 20, 169–171 (2004)
Castelo, R., Koc̆ka, T.: On inclusion-driven learning of Bayesian networks. J. Mach. Learn. Res. 4, 527–574 (2003)
Cawley, S.: Statistical models for DNA sequencing and analysis. Ph.D. thesis, University of California, Berkeley (2000)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000)
Claverie, J.-M., Sauvaget, I., Bougueleret, L.: k-Tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. Methods Enzymol. 183, 237–252 (1990)
Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9, 309–347 (1992)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Crooks, G.E., Hon, G., Chandonia, J.-M., Brenner, S.E.: WebLogo: a sequence logo generator. Genome Res. 14, 1188–1190 (2004)
Ding, C.H.Q., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17, 349–358 (2001)
Ellrott, K., Yang, C., Sladek, F.M., Jiang, T.: Identifying transcription factor binding sites through Markov chain optimization. Bioinformatics 18, S100–S109 (2002)
Fickett, J.W., Tung, C.-S.: Assessment of protein coding measures. Nucleic Acids Res. 20, 6441–6450 (1992)
Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179–188 (1936)
Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906–914 (2000)
Gregory, T.R.: Coincidence, coevolution, or causation? DNA content, cell size, and the C-value enigma. Biol. Rev. 76, 65–101 (2001)
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)
Ikemura, T.: Correlation between the abundance of Escherichia coli transfer RNAs and the occurence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 151, 389–409 (1981)
Jaakola, T.S., Diekhans, M., Haussler, D.: Using the Fisher kernel method to detect remote protein homologies. Proc. Int. Conf. Intell. Syst. Mol. Biol. 7, 149–158 (1999)
Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106, 620–630 (1957)
Jaynes, E.T.: Information theory and statistical mechanics II. In: Ford, K. (ed.) Statistical Physics, pp. 181–218. Benjamin, New York (1963)
Koc̆ka, T., Castelo, R.: Improved learning of Bayesian networks. In: Proceedings of Uncertainty in Artificial Intelligence, pp. 269–276 (2001)
Kozak, M.: Point mutations define a sequence flanking the AUG initiator codon that modulates translation by eukaryotic ribosomes. Cell 44, 283–292 (1986)
Kulp, D., Haussler, D., Reese, M.G., Eeckman, F.H.: A generalized hidden Markov model for the recognition of human genes in DNA. Proc. Int. Conf. Intell. Syst. Mol. Biol. 4, 134–142 (1996)
Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al.: Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001)
Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20, 467–476 (2004)
Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Biol. 10, 857–868 (2003)
Lukashin, A.V., Borodvsky, M.: GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107–1115 (1998)
McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York (2004)
Mercer, J.: Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. R. Soc. Lond. A 209, 415–446 (1909)
Munch, K., Krogh, A.: Automatic generation of gene finders for euakryotic species. BMC Bioinform. 7, 263–274 (2006)
Noble, W.S.: Support vector machine applications in computational biology. In: Schölkopf, B., Tsuda, K., Vert, J.-P. (eds.) Kernel Methods in Computational Biology, pp. 1–31. MIT Press, London (2004)
Ohler, U., Harbeck, S., Niemann, H., Nöth, E., Reese, M.G.: Interpolated Markov chains for eukaryotic promoter recognition. Bioinformatics 15, 362–369 (1999)
Ohno, S.: So much “junk” DNA in our genome. Brookhaven Symp. Biol. 23, 366–370 (1972)
Oliver, J.L., Bernaola-Galván, P., Carpena, P., Román-Roldán, R.: Isochore chromosome maps of eukaryotic genomes. Gene 276, 47–56 (2001)
Pavlidis, P., Furey, T.S., Liberto, M., Haussler, D., Grundy, W.N.: Promoter region-based classification of genes. In: Altman, R.B., Dunker, A.K., Hunter, L., Lauderdale, K., Kelin, T.E. (eds.) Pacific Symposium of Biocomputing, pp. 151–163. World Scientific, Singapore (2001)
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco (1988)
Perna, N.T., Plunkett, G., Burland, V., Mau, B., Glasner, J.D., Rose, D.J., Mayhew, G.F., Evans, P.S., Gregor, J., Kirkpatrick, H.A., Pósfai, G., Hackett, J., Klink, S., Boutin, A., Shao, Y., Miller, L., Grotbeck, E.J., Davis, N.W., Lim, A., Dimalanta, E.T., Potamousis, K.D., Apodaca, J., Anantharaman, T.S., Lin, J., Yen, G., Schwartz, D.C., Welch, R.A., Blattner, F.R.: Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409, 529–533 (2001)
Reese, M.G., Eeckman, F.H., Kulp, D., Haussler, D.: Improved splice site detection in genie. J. Comput. Biol. 4, 311–323 (1997)
Rissanen, J.: A universal data compression system. IEEE Trans. Inf. Theory 29, 656–664 (1983)
Rätsch, G., Sonnenburg, S.: Accurate splice site detection for Caenorhabditis elegans. In: Schölkopf, B., Tsuda, K., Vert, J.-P. (eds.) Kernel Methods in Computational Biology, pp. 277–298. MIT Press, London (2004)
Schneider, T.D., Stephens, R.M.: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18, 6097–6100 (1990)
Schukat-Talamazzini, E.G., Gallwitz, F., Harbeck, S., Warnke, V.: Rational interpolation of maximum likelihood predictors in stochastic language modeling. In: Proceedings of Eurospeech’97, pp. 2731–2734. Rhodes, Greece (1997)
Sharp, P.M., Li, W.H.: The codon adaptation index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295 (1987)
Shine, J., Dalgarno, L.: Determinant of cistron specificity in bacterial ribosomes. Nature 254, 34–38 (1975)
Snyder, E.E., Stormo, G.D.: Identification of protein coding regions in genomic DNA. J. Mol. Biol. 248, 1–18 (1995)
Solovyev, V.V., Salamov, A.A., Lawrence, C.B.: Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 22, 5156–5163 (1994)
Solovyev, V.V., Salamov, A.A., Lawrence, C.B.: 82: identification of human gene structure using linear discriminant functions and dynamic programming. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 367–375 (1995)
Staden, R.: Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 12, 505–519 (1984)
Staden, R., McLachlan, A.D.: Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res. 10, 141–156 (1982)
Tsuda, K., Kawanabe, M., Rätsch, G., Sonnenburg, S., Müller, K.-R.: A new discriminative kernel from probabilistic models. Neural Comput. 14, 2397–2414 (2002)
Wright, F.: The ‘effective number of codons’ used in a gene. Gene 87, 23–29 (1990)
Xu, Y., Mural, R.J., Einstein, J.R., Shah, M.B., Uberbacher, E.C.: GRAIL: a multi-agent neural network system for gene identification. Proc. IEEE 84, 1544–1552 (1996)
Xu, Y., Uberbacher, E.C.: Computational gene prediction using neural networks and similarity search. In: Salzberg, S.L., Searls, D.B., Kasif, S. (eds.) Computational Methods in Molecular Biology, pp. 109–128. Elsevier, Amsterdam (1998)
Yeo, G., Burge, C.B.: Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comput. Biol. 11, 377–394 (2004)
Zhao, X., Huang, H., Speed, T.P.: Finding short DNA motifs using permuted Markov models. J. Comput. Biol. 12, 894–906 (2005)
Zhang, M.Q., Marr, T.G.: Weight array methods for splicing signal analysis. Comput. Appl. Biosci. 9, 499–509 (1993)
Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.-R.: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799–807 (2000)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2015 Springer-Verlag London
About this chapter
Cite this chapter
Axelson-Fisk, M. (2015). Gene Structure Submodels. In: Comparative Gene Finding. Computational Biology, vol 20. Springer, London. https://doi.org/10.1007/978-1-4471-6693-1_5
Download citation
DOI: https://doi.org/10.1007/978-1-4471-6693-1_5
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-4471-6692-4
Online ISBN: 978-1-4471-6693-1
eBook Packages: Computer ScienceComputer Science (R0)