Abstract
Given an input sequence of data, a motif is a repeating pat- tern, possibly interspersed with “dont care” characters and a flexible motif could have a variable (as opposed to fixed) number of “dont care” characters. Given a sequence of records with F fields each, an association rule is a common set of f fields, f ≤ F, with identical (or similar) re- peating values. The data in either case could be a sequence of characters or sets of characters or even real values. It is well known that the number of motifs or association rules, say N, could potentially be exponential in the size of the input sequence or number of records, say n. In this paper we present a new algorithm to discover all flexible motifs or association rules in the input. A novel feature of this algorithm is that its running time is linear in the size of the output (ignoring polylog factors). More precisely, the complexity of the algorithm is O(n 5 + N) log n). This is the first algorithm for motif discovery with a proven output sensitive complexity bound. The discovery algorithm works in two phases: in the first phase it detects a linear number of core motifs in time polynomial in the input size n and in the second phase it detects all the remaining motifs N′ in O(N′ log n) time. The core motifs of the first phase are also characterized as being those of “highest specificity”: loosely speaking, a pattern with higher specificity has less “dont care” characters. Some applications (for instance the ones that require the study of those por- tions of the input sequence that contribute to the non-gapped regions of motifs ) require only the core motifs. Hence for such applications, the first phase of the algorithm suffices. However, the general problem is of use in motif discovery tasks in gene or protein sequences, or discovery of association rules from gene expression data or in data mining.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
P.O. Brown and D. Botstein. Exploring the new world of the genome with DNA microarrays. Nature Genetics, 21:33–37, 1999.
A. Ben-Dor and Z. Yakhini. Clustering gene expression patterns. Proceedings of the Annual Conference on Computational Molecular Biology (RECOMB’99), pages 33–42, 1999.
T.L. Bailey and M. Gribskov. Methods and statistics for combining motif match scores. Journal of Computational Biology, 5:211–221, 1998.
Alvis Brazma, Inge Jonassen, Ingvar Eidhammer, and David Gilbert. Approaches to the automatic discovery of patterns in biosequences. Journal of Computational Biology, 5(2): 279–305, 1998.
Andrea Califano. SPLASH: structural pattern localization algorithm by sequential histogramming. Bioinformatics (under publication), 2000.
J. DeRisi and L. Penland et al. Use of a cDNA microarray to analyse gen expression patterns in human cancer. Nat. Genetics, 14(4):457–460, 1996.
M.O. Dayhoff, R.M. Schwartz, and B.C. Orcutt. A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure, pages 345–352, 1978.
S. Henikoff and J.G. Henikoff. Amino cid substitution matrices from protein blocks. Proc. Natl. Acad. Sci., 89:10915–10919, 1992.
I.J.F. Jonassen, J.F. Collins, and D.G. Higgins. Finding flexible patterns in unaligned protein sequences. Protein Science, pages 1587–1595, 1995.
D.J. Lockhart and H. Dong et al. Expression monitoring by hybridization to high density oligonucleotide arrays. Nat. biotechnol., 14(13):1675–1680, 1996.
A.F. Neuwald and P. Green. Detecting patterns in protein sequences. Journal of Molecular Biology, pages 698–712, 1994.
Laxmi Parida. Some results on flexible-pattern matching. In Proc. of the Eleventh Symp. on Comp. Pattern Matching, volume 1848 of Lecture Notes in Computer Science, pages 33–45. Springer-Verlag, 2000.
Laxmi Parida, Isidore Rigoutsos, Aris Floratos, Dan Platt, and Yuan Gao. Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm. In Proceedings of the eleventh ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 297–308. ACM Press, 2000.
I. Rigoutsos and A. Floratos. Motif discovery in biological sequences without alignment or enumeration. In Proceedings of the Annual Conference on Computational Molecular Biology (RECOMB’98), pages 221–227. ACM Press, 1998.
M.A. Roytberg. A search for common patterns in many sequences. CABIOS, pages 57–64, 1992.
R.M. Schwartz and M.O. Dayhoff. Matrices for detecting distance relationships. Atlas of Protein Sequence and Structure, pages 353–358, 1978.
M. Suyama, T. Nishioka, and O. Juníchi. Searching for common sequence patterns among distantly related proteins. Protein Engineering, pages 366–385, 1995.
M.F. Sagot and A. Viari. A double combinatorial approach to discovering patterns in biological sequences. Proceedings of the 7th symposium on combinatorial pattern matching, pages 186–208, 1996.
J. Wang, G. Chirn, T.G. Marr, B.A. Shapiro, D. Shasha, and K. Jhang. Combinatorial pattern discovery for scientific data: some preleminary results. Proceedings of the ACM SIGMOD conference on management of data, pages 115–124, 1996.
L. Wodicka and H. Dong et al. Genome-wide expression monitoring in saccharomyces cerevisiae. Nat. biotechnol., 15(13):1359–1367, 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Parida, L., Rigoutsos, I., Platt, D. (2001). An Output-Sensitive Flexible Pattern Discovery Algorithm. In: Amir, A. (eds) Combinatorial Pattern Matching. CPM 2001. Lecture Notes in Computer Science, vol 2089. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48194-X_11
Download citation
DOI: https://doi.org/10.1007/3-540-48194-X_11
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42271-6
Online ISBN: 978-3-540-48194-2
eBook Packages: Springer Book Archive