Abstract
Other chapters of this volume have presented the various experimental methods (mainly exon trapping and recombination-based and hybridization-based approaches) used for the identification of transcribed sequences within cloned genomic fragments. None of those methods require detailed sequence information on the genomic region of interest. However, since generating large genomic sequences is becoming more routine, identifying transcribed regions by computer analysis of large genomic sequence (i.e., “software trapping”) is also becoming a viable alternative. After an overview of the various computational methods at hand, this chapter focuses on the use of database similarity searches for the identification of exons in mammalian genomes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Senapathy, P., Shapiro, M. B., and Harris, N. L. (1990) Splice junctions, Branch point sites, and exons: sequence statistics, identification, and applications to genome project. Methods Enzymol. 183, 252–278.
Stormo, G. D. (1990) Consensus patterns in DNA. Methods Enzymol. 183, 211–221.
Brunak, S., Engelbrecht, J., and Knudsen, S. (1991) Prediction of human mRNA donor and acceptor sites from the DNA sequence. J. Mol. Biol. 220, 49–65.
Legouis, R., Hardelin, J.-P., Levilliers, J., Claverie, J.-M., Compain, S., Wunderle, V., Millasseau, P., Le Paslier, D., Cohen, D., Caterina, D., Bougueleret, L., Lutfalla, G., Weissenbach, J., and Petit, C. (1991) The candidate gene for the X-linked Kallmann syndrome encodes a protein related to adhesion molecules. Cell 67, 423–435.
Hawkins, J. D. (1988) A survey of intron and exon lengths. Nucleic Acids Res. 21, 9893–9908.
Snyder, E. E. and Stormo, G. D. (1995) Identification of protein coding regions in genomic DNA. J. Mol. Biol 248, 1–18.
Grantham, R., Gautier, C., Gouy, M., Mercier, R., and Pavé, A. (1980) Codon catalog usage and the genome hypothesis. Nucleic Acids Res. 8, r49–r60.
Staden, R. (1990) Finding protein coding regions in genomic sequences. Methods Enzymol. 183, 163–180.
Shepherd, J. C. W. (1981) Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proc. Natl. Acad. Sci. USA 78, 1596–1600.
Shepherd, J. C. W. (1990) Ancient patterns in nucleic acid sequences. Methods Enzymol. 183, 180–192.
Fickett, J. W. (1982) Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 10, 5303–5318.
Claverie, J.-M. and Bougueleret, L. (1986) Heuristic informational analysis of sequences. Nucleic Acids Res. 14, 179–196.
Beckmann, J. S., Brendel, V., and Trifonov, E. N. (1986) Intervening sequences exhibit distinct vocabulary. J. Biomol Struct Dynamics 4, 391–400.
Borodovsky, M., Sprizhitskn, Y. A., Golovanov, E. I., and Aleksandrov, A. A. (1986) Statistical patterns in primary structure of the functional regions of the genome in E. coli III. Computer recognition of coding regions. Molekulyarnaya Biologiya 20, 1390–1398.
Fickett, J. W. and Tung, C.-S. (1992) Assessment of protein coding measures. Nucleic Acids Res. 20, 6441–6450.
Claverie, J.-M., Sauvaget, I., and Bougueleret, L. (1990) k-Tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. Methods Enzymol. 183, 237–252.
Bougueleret, L., Tekaia F., Sauvaget, I., and Claverie, J.-M (1988) Objective comparison of exon and intron sequences by the mean of 2-dimensional data analysis methods. Nucleic Acids Res. 16, 1729–1738.
Borodovsky, M. Y., Rudd, K. E., and Koonin, E. V. (1994) Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. Nucleic Acids Res. 22, 4756–4767.
Fields, C. A. and Soderlund, C. A. (1990) Gm: a practical tool for automating DNA sequence analysis. Comp. Appl. Biol Sci. 6, 263–270.
Iris, F. J. M., Bougueleret, L., Prieur, S., Caterina, D., Primas, G., Perrot, V., Jurka, J., Rodriguez-tome, P., Claverie, J.-M., Cohen, D., and Dausset, J. (1993) Dense Alu clustering and a potential new member of the NF-kappa B family within a 90 kb HLA class III segment. Nature Genet. 3, 137–145.
Uberbacher, E. C. and Mural, R. J. (1991) Locating protein-coding regions in DNA sequences by a multiple sensor-neural approach. Proc. Natl. Acad. Sci. USA 88, 11,261–11,265.
Xu, Y., Einstein, J. R., Mural, R. J., Shah, M. B., and Uberbacher, E. C. (1994) Recognizing exons in genomic sequence using grail II, in Genetic Engineering: Principles and Methods (Setlow, J., ed.) Plenum, New York, pp. 241–253.
Sulston, J., Du, Z., Thomas, K., Wilson, R., Hillier, L., Staden, R., Halloran, N., Green, P., Thierry-Mieg, J., Qiu, L., et al. (1992) The C. elegans genome sequencing project a beginning. Nature 356, 37–41.
Guigo, R., Knudsen, S., Drake, N., and Smith, T. F. (1992) Prediction of gene structure. J. Mol. Biol. 226, 141–157.
Snyder, E. E. and Stormo, G. D. (1993) Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Res. 21, 607–613.
Claverie, J.-M. (1995) Progress in large scale sequence analysis, in Advances in Computatzonal Biology, vol. 2 (Villar, H., ed.) JAI, London, pp. 161–208.
Lopez, R., Larsen, F., and Prydz, H. (1994) Evaluation of the exon prediction of the Grail software. Genomics 24, 133–136.
Hunkapiller, T., Kaiser, R. J., Koop, B. F., and Hood, L. (1991) Large-scale and automated DNA sequence determination. Science 254, 59–67.
Olson, M. V. (1993) The human genome project. Proc. Natl. Acad. Sci. USA 90, 4338–4344.
Nowak, R. (1995) Bacterial genome sequence bagged [news]. Science 269, 468–470.
Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J.-F., Dougherty, B. A., Merrick, J. M., et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512.
Adams, M. D., Kelley, J. M., Gocayne, J. D., Dubnick, M., Polymeropoulos, M. H., Xiao, H., Merril, C. R., Wu, A., Olde, B., Moreno, R. F., et al. (1991) Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651–1656.
Adams, M. D., Dubnick, M., Kerlavage, A. R., Moreno, R. F., Kelley, J. M., Utterback, T. R., Nagle, J. W., Fields, C. A., and Venter, J. C. (1992) Sequence identification of 2,375 human brain genes. Nature 355, 632–634.
Adams, M. D., Kerlavage, A. R., Fields, C., and Venter, J. C. (1993) 3,400 new expressed sequence tags identify diversity of transcripts in human brain. Nature Genet. 4, 256–267.
Adams, M. D., Soares, M. B., Kerlavage, A. R., Fields, C., and Venter, J. C. (1993) Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nature Genet. 4, 373–380.
Merck releases first “gene index” sequences [news] (1995) Nature 373, 549.
Benson, D. A., Boguski, M., Lipman, D. J., and Ostell, J. (1994) GenBank. Nucleic Acids Res. 22, 3441–3444.
Boguski, M. S., Lowe, T. M., and Tolstoshev, C. M. (1993) dbEST—database for “expressed sequence tags.” Nature Genet. 4, 332,333.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403–410.
Claverie, J.-M. (1992) Identifying coding exons by similarity search: Alu-derived and other potentially misleading protein sequences. Genomics 12, 838–841.
Gish, W. and States, D. J. (1993) Identification of protein coding regions by database similarity search. Nature Genet. 3, 266–272.
Claverie, J.-M (1994) A streamlined random sequencing strategy for finding coding exons. Genomics 23, 575–581.
Oliver, S. G., van der Aart, Q. J., Agostoni-Carbone, M. L., Aigle, M., Alberghina, L., Alexandraki, D., Antoine, G., Anwar, R., Ballesta, J. P., Benit, P., et al. (1992) The complete DNA sequence of yeast chromosome III. Nature 357, 38–46.
Dujon, B., Alexandraki, D., Andre, B., Ansorge, W., Baladron, V., Ballesta, J. P., Banrevi, A., Bolle, P. A., Bolotin-Fukuhara, M., Bossier, P., et al. (1994) Complete DNA sequence of yeast chromosome XI. Nature 369, 371–378.
Wilson, R., Ainscough, R., Anderson, K., Baynes, C., Berks, M., Bonfield, J., Burton, J., Connell, M., Copsey, T., Cooper, J., et al. (1994) 2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans. Nature 368, 32–38.
Green, P., Lipman, D., Hillier, L., Waterston, R., States, D., and Claverie, J.-M. (1993) Ancient conserved regions in new gene sequences and the protein databases. Science 259, 1711–1716.
Claverie, J.-M. (1993) Database of ancient sequences. Nature 364, 19,20.
Bairoch, A. and Boeckmann, B. (1994) The SWISS-PROT protein sequence database: current status. Nucleic Acids Res. 22, 3578–3580.
Brockdorff, N., Ashworth, A., Kay, G. F., McCabe, V. M., Norris, D. P., Cooper, P. J., Swift, S., and Rastan, S. (1992) The product of the mouse Xist gene is a 15 kb inactive X-specific transcript containing no conserved ORF and located in the nucleus. Cell 71, 515–526.
Brannan, C. I., Dees, E. C., Ingram, R. S., and Tilghman, S. M. (1990) The product of the H19 gene may function as an RNA. Mol. Cell Biol. 10, 28–36.
Velleca, M. A., Wallace, M. C., and Merlie, J. P. (1994) A novel synapse-associated noncoding RNA. Mol. Cell Biol. 14, 7095–7104.
Askew, D. S., Li, J., and Ihle, J. N. (1994) Retroviral insertions in the murine His-1 locus activate the expression of a novel RNA that lacks an extensive open reading frame. Mol. Cell. Biol. 14, 1743–1751.
Fichant, G. A. and Burks, C. (1991) Identifying potential genes in genomic DNA sequences J. Mol Biol. 220, 659–671.
States, D. J., Gish, W., and Altschul, S. F. (1991) Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods 3, 66–70.
Altschul, S. F. (1991) Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219, 555–565.
Claverie, J.-M. (1993) Detecting frame shifts by amino acid sequence comparison. J. Mol. Biol. 234, 1140–1157.
Henikoff, S. and Henikoff, J. G. (1993) Performance evaluation of amino acid substitution matrices. Proteins 17, 49–61.
Claverie, J.-M. (1994) A streamlined random sequencing strategy for finding coding exons. Genomics 23, 575–581.
Rice, C. M. and Cameron, G. N. (1994) Submission of nucleotide sequences data to EMB/Genbank/DDBJ. Methods Mol. Biol. 24, 355–366.
Pearson, W. R. (1990) rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183, 4698–4702.
Sturrock, S. and Collins, J. (1993) MPsrch version 1.3. Biocomputing Research Unit, University of Edinburgh, UK.
Claverie, J. M. and Makalowski, W. (1994) Alu alert. Nature 371, 752–752.
Claverie, J. M. and States, D. (1993) Information enhancement methods for large scale sequence analysis. Computers Chem. 17, 191–201.
Wootton, J. C. and Federhen, S. (1993) Statistics of local complexity in amino acid sequences and sequence databases. Computers Chem. 17, 149–163.
Claverie, J.-M (1994) Large scale sequence analysis, in Automated DNA Sequencing and Analysis Techniques (Adams, M. D., Fields, C., and Venter, J. C., eds.) Academic, New York, pp 267–279.
Claverie, J. M. (1996) Effective large scale sequence similarity searches. Methods Enzymol. 266, 212–227.
Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C. (1994) Issues in searching molecular sequence databases. Nature Genet. 6, 119–129.
Kehoe, B. P. (1996) Zen and the Art of the Internet. A Beginner’s Guide, 4th ed. Prentice Hall, Englewood Cliffs, NJ.
Swindell, S. R., Miller, R. R., and Myers, G., eds. (1996) Internet for the Molecular Biologist, Horizon Scientific, London, UK.
Burglin, T. R. and Barnes, T. M. (1992) Introns in sequence tags. Nature 357, 367–367.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1997 Humana Press Inc., Totowa, NJ
About this protocol
Cite this protocol
Claverie, JM. (1997). Exon Detection by Similarity Searches. In: Boultwood, J. (eds) Gene Isolation and Mapping Protocols. Methods in Molecular Biology™, vol 68. Humana Press. https://doi.org/10.1385/0-89603-482-8:283
Download citation
DOI: https://doi.org/10.1385/0-89603-482-8:283
Publisher Name: Humana Press
Print ISBN: 978-0-89603-482-2
Online ISBN: 978-1-59259-554-9
eBook Packages: Springer Protocols