Abstract
In the last years, the completion of the human genome sequencing showed up a wide range of new challenging issues involving raw data analysis. In particular, the discovery of information implicitly encoded in biological sequences is assuming a prominent role in identifying genetic diseases and in deciphering biological mechanisms. This information is usually represented by patterns frequently occurring in the sequences. Because of biological observations, a specific class of patterns is becoming particularly interesting: frequent structured patterns. In this respect, it is biologically meaningful to look at both “exact” and “approximate” repetitions of the patterns within the available sequences. This paper gives a contribution in this setting by providing some algorithms which allow to discover frequent structured patterns, either in “exact” or “approximate” form, present in a collection of input biological sequences.
Corresponding author
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. Amir, M. Farach, Z. Galil, R. Giancarlo, and K. Park. Dinamic dictionary matching. Journal of Computer and System Science, 49:208–222, 1994.
A. Apostolico and M. Crochemore. String matching for a deluge survival kit. Handbook of Massive Data Sets, To appear.
A. Bairoch. PROSITE: A dictionary of protein sites and patterns. Nucleic Acid Research, 20:2013–2018, 1992.
G. Benson. An algorithm for finding tandem repeats of unspecified pattern size. In Proceedings of ACM Recomb, pages 20–29, 1998.
P. Bieganski, J. Riedl, J. V. Carlis, and E. M. Retzel. Generalized suffix trees for biological sequence data: Applications and implementations. In Proc. of the 27th Hawai Int. Conf. on Systems Science, pages 35–44. IEEE Computer Society Press, 1994.
A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to the automatic discovery of patterns in biosequences. Journal of Computational Biology, 5(2):277–304, 1998.
Y. M. Fraenkel, Y. Mandel, D. Friedberg, and H. Margalit. Identification of common motifs in unaligned dna sequnces: application to escherichia coli lrp regulon. Computer Applied Bioscience, 11:379–387, 1995.
D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for dna sequences: Analysis of promoter sequences from escherichia coli. J. of Molecular Biology, 186:117–128, 1985.
C. A. Gross, M. Lonetto, and R. Losick. Bacterial sigma factors. Transcriptional Regulation, 1:129–176, 1992.
D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambrige University Press, 1997.
D. Gusfield, G. M. Landau, and B. Schieber. An efficient algorithm of all pairs suffix-prefix problem. Information Processing Letters, 41:181–185, 1992.
J. Helden, A. F. Rios, and J. Collado-Vides. Discovering regulatory elements in noncoding sequences by analysis of spaced dyads. Nucleic Acids Research, 28(8):1808–1818, 2000.
A. Klingenhofen, K. Frech, K. Quandt, and T. Werner. Functional promoter modules can be detected by formal methods independent of overall sequence similarity. Bioinformatics, 15:180–186, 1999.
L. Marsan and M. F. Sagot. Algorithms for extracting structured motifs using a suffix tree with application to promoter and regulatory site consensus identification. Journal of Computational Biology, 7:345–360, 2000.
M. F. Sagot and E. W. Myers. Identifying satellites in nucleic acid sequences. In Proc. of ACM RECOMB, pages 234–242, 1998.
H. O. Smith, T. M. Annau, and S. Chandrasegaran. Finding sequence motifs in groups of functionally related proteins. In Proc. of National Academy of Science, pages 118–122, U.S.A., 1990.
R. L. Tatusov, S. F. Altschul, and E. V. Koonin. Detection of conserved segments in proteins. In Proc. of National Academy of Science, pages 12091–12095, U.S.A., 1994.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Palopoli, L., Terracina, G. (2002). Discovering Frequent Structured Patterns from String Databases: An Application to Biological Sequences. In: Lange, S., Satoh, K., Smith, C.H. (eds) Discovery Science. DS 2002. Lecture Notes in Computer Science, vol 2534. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36182-0_6
Download citation
DOI: https://doi.org/10.1007/3-540-36182-0_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00188-1
Online ISBN: 978-3-540-36182-4
eBook Packages: Springer Book Archive