Skip to main content

Gene Structure Submodels

  • Chapter
  • First Online:
  • 1344 Accesses

Part of the book series: Computational Biology ((COBO,volume 20))

Abstract

A gene model algorithm integrates a wide range of scores, or signals, coming from the ingoing states of the model. These states are themselves complex submodels, which incorporate a number of sensors used to score the different characteristics of the submodel. Such sensors are traditionally divided into two groups: content sensors and signal sensors. Signal sensors model the transition between states, and attempt to detect the boundaries between exons and introns in the sequence, while content sensors score the content of a candidate region, such as the base composition or length distribution of a candidate exon or intron. In this chapter we describe some of the main submodels used in gene finding algorithms, and detail a number of different methods for integrating the sensors the submodels incorporate.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Aizerman, M., Braverman, E., Rozonoer, L.: Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control 25, 821–837 (1964)

    MathSciNet  Google Scholar 

  2. Alexandersson, M., Cawley, S., Pachter, L.: SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res. 13, 496–502 (2003)

    Article  Google Scholar 

  3. Axelson-Fisk, M., Sunnerhagen, P.: Gene finding in fungal genomes. In: Sunnerhagen, P., Piskur, J. (eds.) Topics in Current Genetics: Comparative Genomics Using Fungi as Models, pp. 1–29. Springer, Berlin (2005)

    Google Scholar 

  4. Bennetzen, J.L., Hall, B.D.: Codon selection in yeast. J. Biol. Chem. 257, 3026–3031 (1982)

    Google Scholar 

  5. Bernardi, G.: Isochores and the evolutionary genomics of vertebrates. Gene 241, 3–7 (2000)

    Article  Google Scholar 

  6. Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J., Cuny, G., Menier-Rotival, M., Rodier, F.: The mosaic genome of warm-blooded vertebrates. Science 228, 953–958 (1985)

    Article  Google Scholar 

  7. Biémont, C., Vieira, C.: Junk DNA as an evolutionary force. Nature 443, 521–524 (2006)

    Article  Google Scholar 

  8. Bobbio, A., Horvath, A., Telek, M.: PhFit: a general phase-type fitting tool. Proc. Dep. Syst. Netw. (DSN-02) 1, 1 (2002)

    Google Scholar 

  9. Bobbio, A., Horvath, A., Scarpa, M., Telek, M.: Acyclic discrete phase type distributions: properties and a parameter estimation algorithm. Perform. Eval. 54, 1–32 (2003)

    Article  Google Scholar 

  10. Brown, D.: A note on approximations to probability distributions. Inf. Control 2, 386–392 (1959)

    Article  MATH  Google Scholar 

  11. Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T.S., Ares, M., Haussler, D.: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA 97, 262–267 (2000)

    Article  Google Scholar 

  12. Brunak, S., Engelbrecht, J., Knudsen, S.: Prediction of human mRNA donor and acceptor sites from the DNA sequence. J. Mol. Biol. 220, 49–65 (1991)

    Article  Google Scholar 

  13. Burge, C.: Identification of genes in human genomic DNA. Ph.D. thesis, Stanford University, Stanford (1997)

    Google Scholar 

  14. Burge, C.B.: Modeling dependencies in pre-mRNA splicing signals. In: Salzberg, S.L., Searls, D.B., Kasif, S. (eds.) Computational Methods in Molecular Biology, pp. 109–128. Elsevier, Amsterdam (1998)

    Google Scholar 

  15. Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997)

    Article  Google Scholar 

  16. Bühlmann, P., Wyner, A.J.: Variable length Markov chains. Ann. Stat. 27, 480–513 (1999)

    Article  MATH  Google Scholar 

  17. Castelo, R., Guigó, R.: Splice site identification with idlBNs. Bioinformatics 20, 169–171 (2004)

    Article  Google Scholar 

  18. Castelo, R., Koc̆ka, T.: On inclusion-driven learning of Bayesian networks. J. Mach. Learn. Res. 4, 527–574 (2003)

    MathSciNet  Google Scholar 

  19. Cawley, S.: Statistical models for DNA sequencing and analysis. Ph.D. thesis, University of California, Berkeley (2000)

    Google Scholar 

  20. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000)

    Book  Google Scholar 

  21. Claverie, J.-M., Sauvaget, I., Bougueleret, L.: k-Tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. Methods Enzymol. 183, 237–252 (1990)

    Google Scholar 

  22. Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9, 309–347 (1992)

    MATH  Google Scholar 

  23. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)

    MATH  Google Scholar 

  24. Crooks, G.E., Hon, G., Chandonia, J.-M., Brenner, S.E.: WebLogo: a sequence logo generator. Genome Res. 14, 1188–1190 (2004)

    Article  Google Scholar 

  25. Ding, C.H.Q., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17, 349–358 (2001)

    Article  Google Scholar 

  26. Ellrott, K., Yang, C., Sladek, F.M., Jiang, T.: Identifying transcription factor binding sites through Markov chain optimization. Bioinformatics 18, S100–S109 (2002)

    Article  Google Scholar 

  27. Fickett, J.W., Tung, C.-S.: Assessment of protein coding measures. Nucleic Acids Res. 20, 6441–6450 (1992)

    Article  Google Scholar 

  28. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 179–188 (1936)

    Article  Google Scholar 

  29. Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906–914 (2000)

    Article  Google Scholar 

  30. Gregory, T.R.: Coincidence, coevolution, or causation? DNA content, cell size, and the C-value enigma. Biol. Rev. 76, 65–101 (2001)

    Article  Google Scholar 

  31. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)

    Article  MATH  Google Scholar 

  32. Ikemura, T.: Correlation between the abundance of Escherichia coli transfer RNAs and the occurence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 151, 389–409 (1981)

    Article  Google Scholar 

  33. Jaakola, T.S., Diekhans, M., Haussler, D.: Using the Fisher kernel method to detect remote protein homologies. Proc. Int. Conf. Intell. Syst. Mol. Biol. 7, 149–158 (1999)

    Google Scholar 

  34. Jaynes, E.T.: Information theory and statistical mechanics. Phys. Rev. 106, 620–630 (1957)

    Article  MATH  MathSciNet  Google Scholar 

  35. Jaynes, E.T.: Information theory and statistical mechanics II. In: Ford, K. (ed.) Statistical Physics, pp. 181–218. Benjamin, New York (1963)

    Google Scholar 

  36. Koc̆ka, T., Castelo, R.: Improved learning of Bayesian networks. In: Proceedings of Uncertainty in Artificial Intelligence, pp. 269–276 (2001)

    Google Scholar 

  37. Kozak, M.: Point mutations define a sequence flanking the AUG initiator codon that modulates translation by eukaryotic ribosomes. Cell 44, 283–292 (1986)

    Article  Google Scholar 

  38. Kulp, D., Haussler, D., Reese, M.G., Eeckman, F.H.: A generalized hidden Markov model for the recognition of human genes in DNA. Proc. Int. Conf. Intell. Syst. Mol. Biol. 4, 134–142 (1996)

    Google Scholar 

  39. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al.: Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001)

    Article  Google Scholar 

  40. Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20, 467–476 (2004)

    Article  Google Scholar 

  41. Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Biol. 10, 857–868 (2003)

    Article  Google Scholar 

  42. Lukashin, A.V., Borodvsky, M.: GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107–1115 (1998)

    Article  Google Scholar 

  43. McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York (2004)

    MATH  Google Scholar 

  44. Mercer, J.: Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. R. Soc. Lond. A 209, 415–446 (1909)

    Google Scholar 

  45. Munch, K., Krogh, A.: Automatic generation of gene finders for euakryotic species. BMC Bioinform. 7, 263–274 (2006)

    Article  Google Scholar 

  46. Noble, W.S.: Support vector machine applications in computational biology. In: Schölkopf, B., Tsuda, K., Vert, J.-P. (eds.) Kernel Methods in Computational Biology, pp. 1–31. MIT Press, London (2004)

    Google Scholar 

  47. Ohler, U., Harbeck, S., Niemann, H., Nöth, E., Reese, M.G.: Interpolated Markov chains for eukaryotic promoter recognition. Bioinformatics 15, 362–369 (1999)

    Article  Google Scholar 

  48. Ohno, S.: So much “junk” DNA in our genome. Brookhaven Symp. Biol. 23, 366–370 (1972)

    Google Scholar 

  49. Oliver, J.L., Bernaola-Galván, P., Carpena, P., Román-Roldán, R.: Isochore chromosome maps of eukaryotic genomes. Gene 276, 47–56 (2001)

    Article  Google Scholar 

  50. Pavlidis, P., Furey, T.S., Liberto, M., Haussler, D., Grundy, W.N.: Promoter region-based classification of genes. In: Altman, R.B., Dunker, A.K., Hunter, L., Lauderdale, K., Kelin, T.E. (eds.) Pacific Symposium of Biocomputing, pp. 151–163. World Scientific, Singapore (2001)

    Google Scholar 

  51. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco (1988)

    Google Scholar 

  52. Perna, N.T., Plunkett, G., Burland, V., Mau, B., Glasner, J.D., Rose, D.J., Mayhew, G.F., Evans, P.S., Gregor, J., Kirkpatrick, H.A., Pósfai, G., Hackett, J., Klink, S., Boutin, A., Shao, Y., Miller, L., Grotbeck, E.J., Davis, N.W., Lim, A., Dimalanta, E.T., Potamousis, K.D., Apodaca, J., Anantharaman, T.S., Lin, J., Yen, G., Schwartz, D.C., Welch, R.A., Blattner, F.R.: Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409, 529–533 (2001)

    Article  Google Scholar 

  53. Reese, M.G., Eeckman, F.H., Kulp, D., Haussler, D.: Improved splice site detection in genie. J. Comput. Biol. 4, 311–323 (1997)

    Article  Google Scholar 

  54. Rissanen, J.: A universal data compression system. IEEE Trans. Inf. Theory 29, 656–664 (1983)

    Article  MATH  MathSciNet  Google Scholar 

  55. Rätsch, G., Sonnenburg, S.: Accurate splice site detection for Caenorhabditis elegans. In: Schölkopf, B., Tsuda, K., Vert, J.-P. (eds.) Kernel Methods in Computational Biology, pp. 277–298. MIT Press, London (2004)

    Google Scholar 

  56. Schneider, T.D., Stephens, R.M.: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18, 6097–6100 (1990)

    Article  Google Scholar 

  57. Schukat-Talamazzini, E.G., Gallwitz, F., Harbeck, S., Warnke, V.: Rational interpolation of maximum likelihood predictors in stochastic language modeling. In: Proceedings of Eurospeech’97, pp. 2731–2734. Rhodes, Greece (1997)

    Google Scholar 

  58. Sharp, P.M., Li, W.H.: The codon adaptation index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295 (1987)

    Article  Google Scholar 

  59. Shine, J., Dalgarno, L.: Determinant of cistron specificity in bacterial ribosomes. Nature 254, 34–38 (1975)

    Article  Google Scholar 

  60. Snyder, E.E., Stormo, G.D.: Identification of protein coding regions in genomic DNA. J. Mol. Biol. 248, 1–18 (1995)

    Article  Google Scholar 

  61. Solovyev, V.V., Salamov, A.A., Lawrence, C.B.: Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 22, 5156–5163 (1994)

    Article  Google Scholar 

  62. Solovyev, V.V., Salamov, A.A., Lawrence, C.B.: 82: identification of human gene structure using linear discriminant functions and dynamic programming. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 367–375 (1995)

    Google Scholar 

  63. Staden, R.: Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 12, 505–519 (1984)

    Article  Google Scholar 

  64. Staden, R., McLachlan, A.D.: Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res. 10, 141–156 (1982)

    Article  Google Scholar 

  65. Tsuda, K., Kawanabe, M., Rätsch, G., Sonnenburg, S., Müller, K.-R.: A new discriminative kernel from probabilistic models. Neural Comput. 14, 2397–2414 (2002)

    Article  MATH  Google Scholar 

  66. Wright, F.: The ‘effective number of codons’ used in a gene. Gene 87, 23–29 (1990)

    Article  Google Scholar 

  67. Xu, Y., Mural, R.J., Einstein, J.R., Shah, M.B., Uberbacher, E.C.: GRAIL: a multi-agent neural network system for gene identification. Proc. IEEE 84, 1544–1552 (1996)

    Article  Google Scholar 

  68. Xu, Y., Uberbacher, E.C.: Computational gene prediction using neural networks and similarity search. In: Salzberg, S.L., Searls, D.B., Kasif, S. (eds.) Computational Methods in Molecular Biology, pp. 109–128. Elsevier, Amsterdam (1998)

    Chapter  Google Scholar 

  69. Yeo, G., Burge, C.B.: Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comput. Biol. 11, 377–394 (2004)

    Article  Google Scholar 

  70. Zhao, X., Huang, H., Speed, T.P.: Finding short DNA motifs using permuted Markov models. J. Comput. Biol. 12, 894–906 (2005)

    Article  Google Scholar 

  71. Zhang, M.Q., Marr, T.G.: Weight array methods for splicing signal analysis. Comput. Appl. Biosci. 9, 499–509 (1993)

    Google Scholar 

  72. Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.-R.: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799–807 (2000)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marina Axelson-Fisk .

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag London

About this chapter

Cite this chapter

Axelson-Fisk, M. (2015). Gene Structure Submodels. In: Comparative Gene Finding. Computational Biology, vol 20. Springer, London. https://doi.org/10.1007/978-1-4471-6693-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-6693-1_5

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-6692-4

  • Online ISBN: 978-1-4471-6693-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics