Molecular Biotechnology

, Volume 10, Issue 1, pp 27–48 | Cite as

Computational methods for exon detection

  • Jean-Michel Claverie


Computer methods for the complete and accurate detection of genes in vertebrate genomic sequences are still a long way to perfection. The intermediate task of identifying the coding moiety of genes (coding exons) is now reasonably well achieved using a combination of methods. After reviewing the intrinsic difficulties in interpreting vertebrate genomic sequences, this article presents the state-of-the-art, with an emphasis on similarity search methods and the resources available through Internet.

Index Entries

Bioinformatics vertebrate gene finding vertebrate genome annotation similarity search Internet 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Pearce, M., Blake, D. J., Tinsley, J. M., Byth, B. C., Campbell, L., Monaco, A. P., and Davies, K. E. (1993) The utrophin and dystrophin genes share similarities in genomic structure.Hum. Mol. Genet. 2, 1765–1772.PubMedCrossRefGoogle Scholar
  2. 2.
    Levinson, B., Kenwrick, S., Gamel, P., Fisher, K., and Gitschier, J. (1992) Evidence for a third transcript from the human factor VIII gene.Genomics 14, 585–589.PubMedCrossRefGoogle Scholar
  3. 3.
    De Backer, O., Verheyden, A. M., Martin, B., Godelaine, D., De Plaen, E., Brasseur, R., Avner, P., and Boon, T. (1995) Structure, chromosomal location, and expression pattern of three mouse genes homologous to the human MAGE genes.Genomics 28, 74–83.PubMedCrossRefGoogle Scholar
  4. 4.
    Legouis R., Hardelin, J-P., Levilliers, J., Claverie, J.-M., Compain, S., Wunderle, V., Millasseau P., Le Paslier D., Cohen D., Caterina D., Bougueleret, L., Lutfalla G., Weissenbach J., and Petit C. (1991) The candidate gene for the X-linked Kallmann syndrome encodes a protein related to adhesion molecules.Cell 67, 423–435.PubMedCrossRefGoogle Scholar
  5. 5.
    Senapathy, P., Shapiro, M. B., and Harris, N. L. (1990) Splice junctions, Branch point sites, and exons: sequence statistics, identification, and applications to genome project.Methods Enzymol. 183, 252–278.PubMedGoogle Scholar
  6. 6.
    Stormo, G. D. (1990) Consensus patterns in DNA.Methods Enzymol. 183, 211–221.PubMedGoogle Scholar
  7. 7.
    Brunak, S., Engelbrecht, J., and Knudsen, S. (1991) Prediction of human mRNA donor and acceptor sites from the DNA sequence.J. Mol. Biol. 220, 49–65.PubMedCrossRefGoogle Scholar
  8. 8.
    Simmler, M. C., Cunningham, D., Clerc, P., Vermat T., Cruaud C., Pawlak, A., Szpirer C., Weissenbach, J., Claverie J.-M., and Avner, P. (1996) A 94kb genomic sequence 3′ to the murineXist gene reveals an AT-rich region containing a new testis specific geneTex.Hum. Mol. Genet. 5, 1713–726.PubMedCrossRefGoogle Scholar
  9. 9.
    Hawkins, J. D. (1988) A survey of intron and exon lengths.Nucl. Acids. Res. 21, 9893–9908.CrossRefGoogle Scholar
  10. 10.
    Snyder, E. E., and Stormo, G. D. (1995) Identification of Protein Coding Regions In Genomic DNA.J. Mol. Biol. 248, 1–18.PubMedCrossRefGoogle Scholar
  11. 11.
    Grantham, R., Gautier, C., Gouy, M., Mercier, R., and Pavé, A. (1980) Codon catalog usage and the genome hypothesis.Nucleic Acids Res. 8, r49-r60.PubMedGoogle Scholar
  12. 12.
    Staden, R. (1990) Finding protein coding regions in genomic sequences.Methods Enzymol. 183, 163–180.PubMedGoogle Scholar
  13. 13.
    Shepherd, J. C. W. (1981)Proc. Nat. Acad. Sci. USA 78, 1596–1600.PubMedCrossRefGoogle Scholar
  14. 14.
    Shepherd, J. C. W. Ancient patterns in nucleic acid sequences.Methods Enzymol. 183, 180–192.Google Scholar
  15. 15.
    Fickett, J. W. (1982) Recognition of protein coding regions in DNA sequences.Nucleic Acids Res. 10, 5303–5318.PubMedCrossRefGoogle Scholar
  16. 16.
    Claverie, J.-M., and Bougueleret, L. (1986) Heuristic informational analysis of sequences.Nucleic Acids Res. 14, 179–196.PubMedCrossRefGoogle Scholar
  17. 17.
    Beckmann, J. S., Brendel, V., and Trifonov, E. N. (1986) Intervening sequences exhibit distinct vocabulary.J. Biomolec. Struct. Dynamics 4, 391–400.Google Scholar
  18. 18.
    Borodovsky, M., Sprizhitskii, Y. A., Golovanov, E. I., and Aleksandrov, A. A. (1986) Statistical patterns in primary structure of the functional regions of the genome inE. Coli. III. Computer recognition of coding regions.Molekulyarnaya Biologiya 20, 1390–1398.Google Scholar
  19. 19.
    Fickett, J. W., and Tung, C.-S. (1992) Assessment of protein coding measures.Nucleic Acids Res. 20, 6441–6450.PubMedCrossRefGoogle Scholar
  20. 20.
    Claverie, J.-M., Sauvaget, I., and Bougueleret, L. (1990) k-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping.Meth. Enzym. 183, 237–252.PubMedGoogle Scholar
  21. 21.
    Bougueleret, L., Tekaia F., Sauvaget, I., and Claverie, J.-M. (1988) Objective comparison of exon and intron sequences by the mean of 2-dimensional data analysis methods.Nucleic Acids Res. 16, 1729–1738.PubMedCrossRefGoogle Scholar
  22. 22.
    Borodovsky, M. Y., Rudd, K. E., and Koonin E. V. (1994) Intrinsic and extrinsic approaches for detecting genes in a bacterial genome.Nucleic Acids Res. 22, 4756–4767.PubMedCrossRefGoogle Scholar
  23. 23.
    Uberbacher, E. C., and Mural, R. J. (1991) Locating protein-coding regions in DNA sequences by a multiple sensor-neural approach.Proc. Natl. Acad. Sci. USA 88, 11,261–11,265.CrossRefGoogle Scholar
  24. 24.
    Xu, Y., Einstein, J. R., Mural, R. J., Shah, M. B., and Uberbacher, E. C. (1994) Recognizing exons in genomic sequence using grail II, in:Genetic Engineering: Principles and Methods, (Setlow, J., ed.), Plenum Press.Google Scholar
  25. 25.
    Sulston, J., Du, Z., Thomas, K., Wilson, R., Hillier, L., Staden, R., Halloran, N., Green, P., Thierry-Mieg, J., Qiu, L., et al. (1992) The C. elegans genome sequencing project: a beginning.Nature 356, 37–41.PubMedCrossRefGoogle Scholar
  26. 26.
    Guigo, R., Knudsen, S., Drake, N., and Smith, T. (1992) Prediction of gene structure.J. Mol. Biol. 226, 141–157.PubMedCrossRefGoogle Scholar
  27. 27.
    Solovyev V. V., Salamov A. A., and Lawrence, C. B. (1994) Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames.Nucleic Acids Res. 22, 5156–5163.PubMedCrossRefGoogle Scholar
  28. 28.
    Zhang, M. Q. (1997) Identification of protein coding regions in the human genome by quadratic discriminant analysis.Proc. natl. Acad. Sci. USA 94, 565–568.PubMedCrossRefGoogle Scholar
  29. 29.
    Claverie, J.-M. (1997) Computational methods for the identification of genes in vertebrate genomic sequences.Human Molec. Genetics 6, 1735–1744.CrossRefGoogle Scholar
  30. 30.
    http://igs-server. cnrs-mrs.frGoogle Scholar
  31. 31.
    Wu T. D. (1996) A segment-based dynamic programming algorithm for predicting gene.J. Comput. Biol. 3, 375–394.PubMedCrossRefGoogle Scholar
  32. 32.
    Burge C., and Karlin S. (1997) Prediction of complete gene structure in human genomic DNA.J. Mol. Biol. 268, 1–17.CrossRefGoogle Scholar
  33. 33.
    Xu, Y., Mural R. J., and Uberbacher E. C. (1994) Constructing gene models from accurately predicted exons: an application of dynamic programming.Comput. Appl. Biosci. 10, 613–623.PubMedGoogle Scholar
  34. 34.
    Claverie, J.-M. (1995) Progress in large scale sequence analysis, in:Advances in Computational Biology (H. Villar, ed.), Vol. 2, JAI Press, London.Google Scholar
  35. 35.
    Lopez, R., Larsen, F., and Prydz, H. (1994) Evaluation of the exon prediction of the Grail software.Genomics 24, 133–136.PubMedCrossRefGoogle Scholar
  36. 36.
    Ansari-Lari M. A., Shen, Y., Muzny D. M., Lee, W., and Gibbs R. A. (1997) Large-scale sequencing in human chromosome 12p13: experimental and computational gene structure determination.Genome Res. 7, 268–280.PubMedCrossRefGoogle Scholar
  37. 37.
    Ansari-Lari M. A., Muzny D. M., Lu J., Lu F., Lilley C. E., Spanos S., Malley T., and Gibbs R. A. (1996) A gene-rich cluster between the CD4 and triose-phosphate isomerase genes at human chromosome 12p13.Genome Res. 6, 314–326.PubMedCrossRefGoogle Scholar
  38. 38.
    Hunkapiller, T., Kaiser, R. J., Koop, B. F., and Hood, L. (1991) Large-scale and automated DNA sequence determination.Science 254, 59–67.PubMedCrossRefGoogle Scholar
  39. 39.
    Olson, M. V. (1993) The human genome project.Proc. Natl. Acad. Sci. USA 90, 4338–4344.PubMedCrossRefGoogle Scholar
  40. 40.
    Nowak, R. (1995) Bacterial genome sequence bagged news.Science 269, 468–470.PubMedCrossRefGoogle Scholar
  41. 41.
    Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J.-F., Dougherty, B. A., Merrick, J. M., et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.Science 269, 496–512.PubMedCrossRefGoogle Scholar
  42. 42.
    Adams, M. D., Kelley, J. M., Gocayne, J. D., Dubnick, M., Polymeropoulos, M. H., Xiao, H., Merril, C. R., Wu, A., Olde, B., Moreno, R. F., et al. (1991) Complementary DNA sequencing: expressed sequence tags and human genome project.Science 252, 1651–1656.PubMedCrossRefGoogle Scholar
  43. 43.
    Adams, M. D., Dubnick, M., Kerlavage, A. R., Moreno, R. F., Kelley, J. M., Utterback, T. R., Nagle, J. W., Fields, C. A., and Venter, J. C. (1992) Sequence Identification of 2,375 human brain genes.Nature 355, 632–634.PubMedCrossRefGoogle Scholar
  44. 44.
    Adams, M. D., Kerlavage, A. R., Fields, C., and Venter, J. C. (1993) 3,400 new expressed sequence tags identify diversity of transcripts in human brain.Nature Genet. 4, 256–267.PubMedCrossRefGoogle Scholar
  45. 45.
    Adams, M. D., Soares, M. B., Kerlavage, A. R., Fields, C., and Venter, J. C. (1993) Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library.Nature Genet. 4, 373–380.PubMedCrossRefGoogle Scholar
  46. 46.
    (1995) Merck releases first ‘gene index’ sequences news.Nature 373, 549.Google Scholar
  47. 47.
    Hillier L. D., Lennon G., Becker M., Bonaldo M. F., Chiapelli B., Chissoe S., Dietrich N., DuBuque T., Favello A., Gish W., Hawkins M. Hultman M., Kucaba T., Lacy M., Le M., Le, N., Mardis E., Moore B., Morris M., Parsons J., Prange C., Rifkin L., Rohlfing T., Schellenberg K., Marra M., et al. (1996) Generation and analysis of 280,000 human expressed sequence tags.Genome Res. 6, 807–828.PubMedCrossRefGoogle Scholar
  48. 48.
    Aaronson J. S., Eckman B., Blevins R. A., Borkowski J. A., Myerson J., Imran S., and Elliston K. O. (1996) Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data.Genome Res. 6, 829–845.PubMedCrossRefGoogle Scholar
  49. 49.
    Adams M. D., Kerlavage A. R., Fleischmann R. D., Fuldner R. A., Bult C. J., Lee, N. H., Kirkness E. F., Weinstock K. G., Gocayne J. D., White O., et al. (1995) Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence.Nature 377 (6547 Suppl.), 3–174.PubMedGoogle Scholar
  50. 50.
    Benson, D. A., Boguski, M., Lipman, D. J., and Ostell, J. (1994) GenBank.Nucleic Acids Res. 22, 3441–3444.PubMedCrossRefGoogle Scholar
  51. 51.
    Boguski, M. S., Lowe, T. M., and Tolstoshev, C. M. (1993) dbEST—database for “expressed sequence tags.”Nature Genet. 4, 332–333.PubMedCrossRefGoogle Scholar
  52. 52.
    Kuska, B. 1996. Cancer genome anatomy project set for take-off.J. Natl. Cancer Inst. 88, 1801–1803.PubMedCrossRefGoogle Scholar
  53. 53.
    O'Brien, C. 1997. Cancer genome anatomy project launched.Mol. Med. Today 3, 94.PubMedGoogle Scholar
  54. 54.
    Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic local alignment search tool.J. Mol. Biol. 215, 403–410.PubMedGoogle Scholar
  55. 55.
    Altschul, S. F., Madden, T. L., Alejandro A., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.Nucleic Acids Res. 25, 3389–3402.PubMedCrossRefGoogle Scholar
  56. 56.
    Claverie, J-M (1992) Identifying coding exons by similarity search: Alu-derived and other potentially misleading protein sequences.Genomics 12, 838–841.PubMedCrossRefGoogle Scholar
  57. 57.
    Gish, W. and States, D. J. (1993) Identification of protein coding regions by database similarity search.Nature Genet. 3, 266–272.PubMedCrossRefGoogle Scholar
  58. 58.
    Claverie, J.-M. (1994) A treamlined random sequencing strategy for finding coding exons.Genomics 23, 575–581.PubMedCrossRefGoogle Scholar
  59. 59.
    Oliver, S. G., van der Aart, Q. J., Agostoni-Carbone, M. L., Aigle, M., Alberghina, L., Alexandraki, D., Antoine, G., Anwar, R., Ballesta, J. P., Benit, P., et al. (1992) The complete DNA sequence of yeast chromosome III.Nature 357, 38–46.PubMedCrossRefGoogle Scholar
  60. 60.
    Dujon, B., Alexandraki, D., Andre, B., Ansorge, W., Baladron, V., Ballesta, J. P., Banrevi, A., Bolle, P. A., Bolotin-Fukuhara, M., Bossier, P., et al. (1994) Complete DNA sequence of yeast chromosome XI.Nature 369, 371–378.PubMedCrossRefGoogle Scholar
  61. 61.
    Wilson, R., Ainscough, R., Anderson, K., Baynes, C., Berks, M., Bonfield, J., Burton, J., Connell, M., Copsey, T., Cooper, J., et al. (1994) 2. 2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans.Nature 368, 32–38.PubMedCrossRefGoogle Scholar
  62. 62.
    Green, P., Lipman, D., Hillier, L., Waterston, R., States, D., and Claverie, J.-M. (1993) Ancient conserved regions in new gene sequences and the protein databases.Science 259, 1711–1716.PubMedCrossRefGoogle Scholar
  63. 63.
    Claverie, J.-M. (1993) Database of ancient sequences.Nature 364, 19,20.PubMedGoogle Scholar
  64. 64.
    Bairoch, A. and Boeckmann, B. (1994) The SWISS-PROT protein sequence database: current status.Nucleic Acids Res. 22, 3578–3580.PubMedCrossRefGoogle Scholar
  65. 65.
    Brockdorff, N., Ashworth, A., Kay, G.F., McCabe, V. M., Norris, D. P., Cooper, P. J., Swift, S., and Rastan, S. (1992) The product of the mouse Xist gene is a 15 kb inactive X-specific transcript containing no conserved ORF and located in the nucleus.Cell 71, 515–526.PubMedCrossRefGoogle Scholar
  66. 66.
    Pfeifer K., Leighton P. A., and Tilghman S. M. (1996) The structural H19 gene is required for transgene imprinting.Proc. Natl. Acad. Sci. USA 93, 13,876–13,883.CrossRefGoogle Scholar
  67. 67.
    Wevrick R., and Francke U. (1997) An imprinted mouse transcript homologous to the human imprinted in Prader-Willi syndrome (IPW) gene.Hum. Mol. Genet. 6, 325–332.PubMedCrossRefGoogle Scholar
  68. 68.
    Velleca, M. A., Wallace, M. C., and Merlie, J. P. (1994) A novel synapse-associated noncoding RNA.Mol. Cell. Biol. 14, 7095–7104.PubMedGoogle Scholar
  69. 69.
    Askew, D. S., Li, J., and Ihle, J. N. (1994) Retroviral insertions in the murine His-1 locus activate the expression of a novel RNA that lacks an extensive open reading frame.Mol. Cell. Biol. 14, 1743–1751.PubMedGoogle Scholar
  70. 70.
    Liu A. Y., Torchia B. S., Migeon B. R., and Siliciano R. F. (1997) The human NTT gene: identification of a novel 17-kb noncoding nuclear RNA expressed in activated CD4+ T cells.Genomics 39, 171–184.PubMedCrossRefGoogle Scholar
  71. 71.
    Fichant, G. A. and Burks, C. (1991) Identifying potential genes in genomic DNA sequences.J. Mol. Biol. 220, 659–671.PubMedCrossRefGoogle Scholar
  72. 72.
    Laferriere A., Gautheret D., and Cedergren R. (1994) An RNA pattern matching program with enhanced performance and portability.Comput. Appl. Biosci. 10, 211,212.PubMedGoogle Scholar
  73. 73.
    States, D. J., Gish, W., and Altschul, S. F. (1991) Improved sensitivity of nucleic acid database searches using application-specific scoring matrices.Methods 3, 66–70.CrossRefGoogle Scholar
  74. 74.
    Altschul, S. F. (1991) Amino acid substitution matrices from an information theoric perspective.J. Mol. Biol. 219, 555–565.PubMedCrossRefGoogle Scholar
  75. 75.
    Claverie, J.-M. (1993) Detecting Frame shifts by amino acid sequence comparison.J. Mol. Biol. 234, 1140–1157.PubMedCrossRefGoogle Scholar
  76. 76.
    Henikoff, S. and Henikoff, J. G. (1993) Performance evaluation of amino acid substitution matrices.Proteins 17, 49–61.PubMedCrossRefGoogle Scholar
  77. 77.
    Claverie, J-M. (1994) A streamlined random sequencing strategy for finding coding exons.Genomics 23, 575–581.PubMedCrossRefGoogle Scholar
  78. 78.
    Rice, C. M. and Cameron, G. N. (1994) Submission of nucleotide sequences data to EMBL/Genbank/DDBJ.Methods Mol. Biol. 24, 355–366.PubMedGoogle Scholar
  79. 79.
    Pearson W. R. (1990) rapid and sensitive sequence comparison with FASTP and FASTA.Meth. Enzymol. 183, 4698–4702.Google Scholar
  80. 80.
    Sturrock, S. and Collins, J. (1993) MPsrch version 1.3. Biocomputing Research Unit, University of Edinburgh, UK.Google Scholar
  81. 81.
    Claverie, J. M. and Makalowski, W. (1994) Alu alert.Nature 371, 752–752.PubMedCrossRefGoogle Scholar
  82. 82.
    Kehoe, B. P. (1996)Zen and the Art of the Internet: A Beginner's Guide. Fourth Edition. Prentice Hall: Englewood Cliffs, NJ.Google Scholar
  83. 83.
    Internet for the Molecular Biologist (1996) (Swindell, S. R., Miller, R. R., and Myers G., eds.), ISBN1-898486-02-6, Horizon Scientific Press, London, UK.Google Scholar
  84. 84.
    Claverie, J. M. and States, D. (1993) Information enhancement methods for large scale sequence analysis.Computers Chem. 17, 191–201.CrossRefGoogle Scholar
  85. 85.
    Claverie, J.-M. (1994) Large scale sequence analysis, inAutomated DNA Sequencing and Analysis Techniques (Adams, M. D., Fields, C., and Venter, J. C., eds.), Academic Press, New York, pp. 267–279.Google Scholar
  86. 86.
    Claverie, J. M. (1996) Effective large scale sequence similarity searches, inComputer Methods for Macromolecular Sequence Analysis (Doolittle, R., ed.), pp. 212–227.Google Scholar
  87. 87.
    Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C. (1994) Issues in searching molecular sequence databases.Nature Genet. 6, 119–129.PubMedCrossRefGoogle Scholar
  88. 88.
    Burglin, T. R., and Barnes, T. M. (1992) Introns in sequence tags.Nature 357, 367.PubMedCrossRefGoogle Scholar
  89. 89.
    Smit A. F. A. and Green P. (1997) The RepeatMasker program, available at Scholar

Copyright information

© Humana Press Inc 1998

Authors and Affiliations

  1. 1.MarseilleFrance

Personalised recommendations