Skip to main content

An Output-Sensitive Flexible Pattern Discovery Algorithm

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2089))

Included in the following conference series:

Abstract

Given an input sequence of data, a motif is a repeating pat- tern, possibly interspersed with “dont care” characters and a flexible motif could have a variable (as opposed to fixed) number of “dont care” characters. Given a sequence of records with F fields each, an association rule is a common set of f fields, fF, with identical (or similar) re- peating values. The data in either case could be a sequence of characters or sets of characters or even real values. It is well known that the number of motifs or association rules, say N, could potentially be exponential in the size of the input sequence or number of records, say n. In this paper we present a new algorithm to discover all flexible motifs or association rules in the input. A novel feature of this algorithm is that its running time is linear in the size of the output (ignoring polylog factors). More precisely, the complexity of the algorithm is O(n 5 + N) log n). This is the first algorithm for motif discovery with a proven output sensitive complexity bound. The discovery algorithm works in two phases: in the first phase it detects a linear number of core motifs in time polynomial in the input size n and in the second phase it detects all the remaining motifs N′ in O(N′ log n) time. The core motifs of the first phase are also characterized as being those of “highest specificity”: loosely speaking, a pattern with higher specificity has less “dont care” characters. Some applications (for instance the ones that require the study of those por- tions of the input sequence that contribute to the non-gapped regions of motifs ) require only the core motifs. Hence for such applications, the first phase of the algorithm suffices. However, the general problem is of use in motif discovery tasks in gene or protein sequences, or discovery of association rules from gene expression data or in data mining.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. P.O. Brown and D. Botstein. Exploring the new world of the genome with DNA microarrays. Nature Genetics, 21:33–37, 1999.

    Article  Google Scholar 

  2. A. Ben-Dor and Z. Yakhini. Clustering gene expression patterns. Proceedings of the Annual Conference on Computational Molecular Biology (RECOMB’99), pages 33–42, 1999.

    Google Scholar 

  3. T.L. Bailey and M. Gribskov. Methods and statistics for combining motif match scores. Journal of Computational Biology, 5:211–221, 1998.

    Article  Google Scholar 

  4. Alvis Brazma, Inge Jonassen, Ingvar Eidhammer, and David Gilbert. Approaches to the automatic discovery of patterns in biosequences. Journal of Computational Biology, 5(2): 279–305, 1998.

    Article  Google Scholar 

  5. Andrea Califano. SPLASH: structural pattern localization algorithm by sequential histogramming. Bioinformatics (under publication), 2000.

    Google Scholar 

  6. J. DeRisi and L. Penland et al. Use of a cDNA microarray to analyse gen expression patterns in human cancer. Nat. Genetics, 14(4):457–460, 1996.

    Article  Google Scholar 

  7. M.O. Dayhoff, R.M. Schwartz, and B.C. Orcutt. A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure, pages 345–352, 1978.

    Google Scholar 

  8. S. Henikoff and J.G. Henikoff. Amino cid substitution matrices from protein blocks. Proc. Natl. Acad. Sci., 89:10915–10919, 1992.

    Article  Google Scholar 

  9. I.J.F. Jonassen, J.F. Collins, and D.G. Higgins. Finding flexible patterns in unaligned protein sequences. Protein Science, pages 1587–1595, 1995.

    Google Scholar 

  10. D.J. Lockhart and H. Dong et al. Expression monitoring by hybridization to high density oligonucleotide arrays. Nat. biotechnol., 14(13):1675–1680, 1996.

    Article  Google Scholar 

  11. A.F. Neuwald and P. Green. Detecting patterns in protein sequences. Journal of Molecular Biology, pages 698–712, 1994.

    Google Scholar 

  12. Laxmi Parida. Some results on flexible-pattern matching. In Proc. of the Eleventh Symp. on Comp. Pattern Matching, volume 1848 of Lecture Notes in Computer Science, pages 33–45. Springer-Verlag, 2000.

    Chapter  Google Scholar 

  13. Laxmi Parida, Isidore Rigoutsos, Aris Floratos, Dan Platt, and Yuan Gao. Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm. In Proceedings of the eleventh ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 297–308. ACM Press, 2000.

    Google Scholar 

  14. I. Rigoutsos and A. Floratos. Motif discovery in biological sequences without alignment or enumeration. In Proceedings of the Annual Conference on Computational Molecular Biology (RECOMB’98), pages 221–227. ACM Press, 1998.

    Google Scholar 

  15. M.A. Roytberg. A search for common patterns in many sequences. CABIOS, pages 57–64, 1992.

    Google Scholar 

  16. R.M. Schwartz and M.O. Dayhoff. Matrices for detecting distance relationships. Atlas of Protein Sequence and Structure, pages 353–358, 1978.

    Google Scholar 

  17. M. Suyama, T. Nishioka, and O. Juníchi. Searching for common sequence patterns among distantly related proteins. Protein Engineering, pages 366–385, 1995.

    Google Scholar 

  18. M.F. Sagot and A. Viari. A double combinatorial approach to discovering patterns in biological sequences. Proceedings of the 7th symposium on combinatorial pattern matching, pages 186–208, 1996.

    Google Scholar 

  19. J. Wang, G. Chirn, T.G. Marr, B.A. Shapiro, D. Shasha, and K. Jhang. Combinatorial pattern discovery for scientific data: some preleminary results. Proceedings of the ACM SIGMOD conference on management of data, pages 115–124, 1996.

    Google Scholar 

  20. L. Wodicka and H. Dong et al. Genome-wide expression monitoring in saccharomyces cerevisiae. Nat. biotechnol., 15(13):1359–1367, 1997.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Parida, L., Rigoutsos, I., Platt, D. (2001). An Output-Sensitive Flexible Pattern Discovery Algorithm. In: Amir, A. (eds) Combinatorial Pattern Matching. CPM 2001. Lecture Notes in Computer Science, vol 2089. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48194-X_11

Download citation

  • DOI: https://doi.org/10.1007/3-540-48194-X_11

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42271-6

  • Online ISBN: 978-3-540-48194-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics