Abstract
Gene structure prediction is one of the most important problems in computational molecular biology. Previous attempts to solve this problem were based on statistics and artificial intelligence and, surprisingly enough, applications of theoretical computer science methods for gene recognition were almost unexplored. Recent advances in large-scale cDNA sequencing open a way towards a new combinatorial approach to gene recognition. This paper describes a spliced alignment algorithm and a software tool which explores all possible exon assemblies in polynomial time and finds the multiexon structure with the best fit to a related protein. Unlike other existing methods, the algorithm successfully recognizes genes even in the case of short exons or exons with unusual codon usage; we also report correct assemblies for genes with more than 10 exons. On a test sample of human genes with known mammalian relatives the average correlation between the predicted and the actual genes was 99%, which is a very high accuracy as compared with other existing methods. The algorithm correctly reconstructed 87% of genes and the rare discrepancies between the predicted and real exonintron structures were caused by either (i) extremely short (less than 5 amino acids) initial or terminal exons, or (ii) alternative splicing, or (iii) errors in database feature tables. Moreover, the algorithm predicts human genes reasonably well when the homologous protein is non-vertebrate or even prokaryotic. The surprizingly good performance of the method was confirmed by extensive simulations: in particular, with target proteins showing just 25% similarity, the correlation between the predicted and actual genes still was as high as 95%.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
The research was supported by DOE grant DE-FG02-95ER61919, Russian Fund of Fundamental Research grant 94-04-12330, grant MTW300 from ISF, and the Russian State Program ”Human Genome”.
The research was supported by DOE grant DE-FG02-95ER61919 and the Russian State Program ”Human Genome”.
The research was supported by DOE grant DE-FG02-95ER61919 by NSF Young Investigator award CCR-9457784.
This is a preview of subscription content, log in via an institution.
Preview
Unable to display preview. Download preview PDF.
References
Adams M.D., Kerlavage A.R., Fields C., Venter J.C. (1993) Nature Genet., 4, 256–267.
Altschul S.F. (1991) J. Mol. Biol., 219, 555–565.
Burset M., Guigo R. (1995) (Submitted).
Chao K.M., Hardison R.S., Miller W. (1994) J. Comp. Biol., 1, 271–291.
Dong S., Searls D.B. (1994) Genomics, 23, 540–551.
Dayhoff M.O., Schwartz R.M., Orcutt B.C. (1978) Atlas of Protein Sequence and Structure (Dayhoff M.O.), 5, suppl. 3, 345–352.
Fickett J.W. (1982) Nucleic Acids Res., 10, 5303–5318.
Fickett J.W. (1995) Computers Chem., 19, in press.
Farach M., Noordewier M., Savari S., Shepp L., Weiner A., Ziv J. (1995) Proc. 6th Annual ACM-SIAM Symposium on Discrete Algorithms, (San Francisco, CA), 48–57.
Gelfand M.S. (1990) Nucleic Acids Res., 18, 5865–5869.
Gelfand M.S. (1995) J. Comput. Biol., 2, 87–115.
Gelfand M.S., Podolsky L.I., Astakhova T.V., Roytberg M.A. (1995) J. Comp. Biol. (in press).
Glasser S.W., Korfhagen T.R., Perme C.M., Pilot-Matias T.J., Kister S.E., Whitsett J.A. (1988) J. Biol. Chem., 263, 10326–10331.
Gelfand M.S., Roytberg M.A. (1993) BioSystems, 30, 173–182.
Gish W., States D.J. (1993) Nature Genet., 3, 266–272.
Guigo R., Knudsen S., Drake N., Smith T. (1992) J. Mol. Biol., 226, 141–157.
Hirshberg D.S. (1975) Comm. of ACM, 18, 341–343.
Harr R., Haggstrom M., Gustaffson P. (1983) Nucleic Acids Res., 11, 2943–2957.
Hood L., Koop B.F., Rowen L., Wang K. (1993) Cold Spring Harbor Symp. Quant. Biol., 58, 339–348.
Kelleher K., Bean K., Clark S.C., Leung W.-Y, Yang-Feng T.L., Chen J.W., Lin P.-F.M., Luo W., Yang Y.-C. (1991) Blood, 77, 1436–1441.
Knight J., Myers E.W. (1995) Algorithmica, 13, 211–243
Knecht L. (1995) 6th Annu. Symp. on Combinatorial Pattern Matching (Galil Z., Ukkonen E., eds.), Lecture Notes in Computer Science, 937, 215–229 (Springer-Verlag, Berlin, 1995).
Kruskal J.B., Sankoff D. (1983) Time Warps, String Edits, and Macromolecules (Kruskal J.B., Sankoff D., eds.), 265–310 (Addison-Wesley, Reading, MA).
Legouis R. et al. (1991) Cell, 67, 423–435.
Myers E.W., Miller W. (1989) Bull. Math. Biol., 51, 5–37.
Myers E.W., Miller W. (1995) Proc. 6th Annual ACM-SIAM Symposium on Discrete Algorithm, (San Francisco, CA), 38–47.
Pascarella S., Argos P. (1992) J. Mol. Biol., 224, 461–471.
Sankoff D. (1992) Mathematical Biosciences, 111, 279–293.
Searls D., Murphy K. (1995) Proc. 3rd Int. Conf. on Intelligent Systems for Molecular Biology, 341–349 (AAAI Press, Cambridge, England).
Song I., Brown D.R., Wiltshire R.N., Gantz I., Trent J.M., Yamada T. (1993) Proc. Natl. Acad. Sci. USA, 90, 9085–9089.
Sze S.-H., Gelfand M.S., Mironov A.A., Pevzner P.A. (1995) (In preparation).
Snyder E.E., Stormo G.D. (1993) Nucleic Acids Res., 21, 607–613.
Snyder E.E., Stormo G.D. (1995) J. Mol. Biol., 248, 1–18.
Solovyev V.V., Salamov A.A., Lawrence C.B. (1994) Nucl. Acids Res., 22, 5156–5163.
Uberbacher E., Mural R. (1991) Proc. Natl. Acad. Sci. USA, 88, 11261–11265.
Waterman M.S. (1995) Introduction to Computational Biology. (Chapman & Hall).
Wilbur W., Lipman D. (1983) Proc. Natl. Acad. Sci. USA 80, 726–730.
Xu Y., Einstein J.R., Mural R.J., Shah M., Uberbacher E.C. (1994) Proc. 2nd Int. Conf. on Intelligent Systems for Molecular Biology (Altman R., Brutlag D., Karp P., Lathrop R., Searls D., eds.), 376–383 (AAAI Press, Menlo Park, CA).
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1996 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gelfand, M.S., Mironov, A.A., Pevzner, P.A. (1996). Spliced alignment: A new approach to gene recognition. In: Hirschberg, D., Myers, G. (eds) Combinatorial Pattern Matching. CPM 1996. Lecture Notes in Computer Science, vol 1075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61258-0_12
Download citation
DOI: https://doi.org/10.1007/3-540-61258-0_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61258-2
Online ISBN: 978-3-540-68390-2
eBook Packages: Springer Book Archive