An Output-Sensitive Flexible Pattern Discovery Algorithm

Parida, Laxmi; Rigoutsos, Isidore; Platt, Dan

doi:10.1007/3-540-48194-X_11

Laxmi Parida⁶,
Isidore Rigoutsos &
Dan Platt

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2089))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

753 Accesses
7 Citations

Abstract

Given an input sequence of data, a motif is a repeating pat- tern, possibly interspersed with “dont care” characters and a flexible motif could have a variable (as opposed to fixed) number of “dont care” characters. Given a sequence of records with F fields each, an association rule is a common set of f fields, f ≤ F, with identical (or similar) re- peating values. The data in either case could be a sequence of characters or sets of characters or even real values. It is well known that the number of motifs or association rules, say N, could potentially be exponential in the size of the input sequence or number of records, say n. In this paper we present a new algorithm to discover all flexible motifs or association rules in the input. A novel feature of this algorithm is that its running time is linear in the size of the output (ignoring polylog factors). More precisely, the complexity of the algorithm is O(n ⁵ + N) log n). This is the first algorithm for motif discovery with a proven output sensitive complexity bound. The discovery algorithm works in two phases: in the first phase it detects a linear number of core motifs in time polynomial in the input size n and in the second phase it detects all the remaining motifs N′ in O(N′ log n) time. The core motifs of the first phase are also characterized as being those of “highest specificity”: loosely speaking, a pattern with higher specificity has less “dont care” characters. Some applications (for instance the ones that require the study of those por- tions of the input sequence that contribute to the non-gapped regions of motifs ) require only the core motifs. Hence for such applications, the first phase of the algorithm suffices. However, the general problem is of use in motif discovery tasks in gene or protein sequences, or discovery of association rules from gene expression data or in data mining.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

P.O. Brown and D. Botstein. Exploring the new world of the genome with DNA microarrays. Nature Genetics, 21:33–37, 1999.
Article Google Scholar
A. Ben-Dor and Z. Yakhini. Clustering gene expression patterns. Proceedings of the Annual Conference on Computational Molecular Biology (RECOMB’99), pages 33–42, 1999.
Google Scholar
T.L. Bailey and M. Gribskov. Methods and statistics for combining motif match scores. Journal of Computational Biology, 5:211–221, 1998.
Article Google Scholar
Alvis Brazma, Inge Jonassen, Ingvar Eidhammer, and David Gilbert. Approaches to the automatic discovery of patterns in biosequences. Journal of Computational Biology, 5(2): 279–305, 1998.
Article Google Scholar
Andrea Califano. SPLASH: structural pattern localization algorithm by sequential histogramming. Bioinformatics (under publication), 2000.
Google Scholar
J. DeRisi and L. Penland et al. Use of a cDNA microarray to analyse gen expression patterns in human cancer. Nat. Genetics, 14(4):457–460, 1996.
Article Google Scholar
M.O. Dayhoff, R.M. Schwartz, and B.C. Orcutt. A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure, pages 345–352, 1978.
Google Scholar
S. Henikoff and J.G. Henikoff. Amino cid substitution matrices from protein blocks. Proc. Natl. Acad. Sci., 89:10915–10919, 1992.
Article Google Scholar
I.J.F. Jonassen, J.F. Collins, and D.G. Higgins. Finding flexible patterns in unaligned protein sequences. Protein Science, pages 1587–1595, 1995.
Google Scholar
D.J. Lockhart and H. Dong et al. Expression monitoring by hybridization to high density oligonucleotide arrays. Nat. biotechnol., 14(13):1675–1680, 1996.
Article Google Scholar
A.F. Neuwald and P. Green. Detecting patterns in protein sequences. Journal of Molecular Biology, pages 698–712, 1994.
Google Scholar
Laxmi Parida. Some results on flexible-pattern matching. In Proc. of the Eleventh Symp. on Comp. Pattern Matching, volume 1848 of Lecture Notes in Computer Science, pages 33–45. Springer-Verlag, 2000.
Chapter Google Scholar
Laxmi Parida, Isidore Rigoutsos, Aris Floratos, Dan Platt, and Yuan Gao. Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm. In Proceedings of the eleventh ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 297–308. ACM Press, 2000.
Google Scholar
I. Rigoutsos and A. Floratos. Motif discovery in biological sequences without alignment or enumeration. In Proceedings of the Annual Conference on Computational Molecular Biology (RECOMB’98), pages 221–227. ACM Press, 1998.
Google Scholar
M.A. Roytberg. A search for common patterns in many sequences. CABIOS, pages 57–64, 1992.
Google Scholar
R.M. Schwartz and M.O. Dayhoff. Matrices for detecting distance relationships. Atlas of Protein Sequence and Structure, pages 353–358, 1978.
Google Scholar
M. Suyama, T. Nishioka, and O. Juníchi. Searching for common sequence patterns among distantly related proteins. Protein Engineering, pages 366–385, 1995.
Google Scholar
M.F. Sagot and A. Viari. A double combinatorial approach to discovering patterns in biological sequences. Proceedings of the 7th symposium on combinatorial pattern matching, pages 186–208, 1996.
Google Scholar
J. Wang, G. Chirn, T.G. Marr, B.A. Shapiro, D. Shasha, and K. Jhang. Combinatorial pattern discovery for scientific data: some preleminary results. Proceedings of the ACM SIGMOD conference on management of data, pages 115–124, 1996.
Google Scholar
L. Wodicka and H. Dong et al. Genome-wide expression monitoring in saccharomyces cerevisiae. Nat. biotechnol., 15(13):1359–1367, 1997.
Article Google Scholar

Download references

Author information

Authors and Affiliations

IBM Thomas J. Watson Research Center, Yorktown Heights, NY, 10598, USA
Laxmi Parida

Authors

Laxmi Parida
View author publications
You can also search for this author in PubMed Google Scholar
Isidore Rigoutsos
View author publications
You can also search for this author in PubMed Google Scholar
Dan Platt
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Bar-Ilan University, 52900, Ramat-Gan, Israel, Atlanta, Georgia, 30332-0280, USA
Amihood Amir

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Parida, L., Rigoutsos, I., Platt, D. (2001). An Output-Sensitive Flexible Pattern Discovery Algorithm. In: Amir, A. (eds) Combinatorial Pattern Matching. CPM 2001. Lecture Notes in Computer Science, vol 2089. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48194-X_11

Download citation

DOI: https://doi.org/10.1007/3-540-48194-X_11
Published: 13 June 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42271-6
Online ISBN: 978-3-540-48194-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics