Skip to main content

Abstract

Today, a huge amount of DNA and protein sequences are available, but the growth of biological knowledge has not kept pace with the increasing data. Hence much more effective computational methods are required to reveal the inherent functional units in these sequences in the form of patterns that could be related back to the biological world and life science applications. Pattern discovery provides such a computational tool. This chapter provides a brief review of known pattern discovery techniques for sequence data and later presents a new pattern discovery framework capable of discovering statistically significant sequence patterns effectively without relying on any prior domain knowledge. In response to the “too many patterns” problem, our algorithm is able to remove redundant patterns including those attributed by their strong statistical significant sub-patterns. It hence renders a compact set of quality patterns making interpretation and further model development much easier. When applying to transcription factor binding site data, it obtains a relatively small set of patterns-14 out of 18 consensus binding sites are associated with our top ranking patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Zhao Q, Bhowmick S S (2003) Sequential Pattern Mining: A Survey, Technical Report, CAIS. Nanyang Technological University, Singapore, No 2003118

    Google Scholar 

  2. Das M K, Dai H K (2007) A Survey of DNA Motif Finding Algorithms. BMC Bioinformatics, 8 (Suppl 7): S21

    Article  Google Scholar 

  3. Tompa M, Li N, Bailey T L et al (2005) Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites. Nature Biotechnology, 23(1): 137–144

    Article  Google Scholar 

  4. Wong A K C, Zhuang D, Gary C L Li et al (2010) Discovery of Non-induced Patterns from Sequences. In: Pattern Recognition in Bioinformatics, pp 149–160

    Google Scholar 

  5. Jonassen I (1996) Efficient Discovery of Conserved Patterns Using a Pattern Graph. Technical Report 118, Department of Informatics, University of Bergen, Norway

    Google Scholar 

  6. Sinha S, Tompa M (2002) Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation. Nucleic Acids Research, 30(24): 5549–5560

    Article  Google Scholar 

  7. Rigoutsos I, A Floratos (1998) Combinatorial Pattern Discovery in Biological Sequences: The TEIRESIAS Algorithm. Bioinformatics, 14(1): 55–67

    Article  Google Scholar 

  8. Parida L, Rigoutsos I, Floratos A et al (2000) Pattern Discovery on Character Sets and Real-valued Data: Linear Bound on Irredundant Motifs and an Efficient Polynomial Time Algorithm. In: Proceedings of the eleventh ACM-SIAM Symposium on Discrete Algorithms, SODA 2000, pp 297–308

    Google Scholar 

  9. Pisanti N, Crochemore M, Grossi R et al (2005) Bases of Motifs for Generating Repeated Patterns with Wild Cards. IEEE/ACM Trans on Computational Biology and Bioinformatics, 2(1): 40–49

    Article  Google Scholar 

  10. Pisanti N, Crochemore M, Grossi R et al (2004) A Comparative Study of Bases for Motif Inference. In String Algorithmics. KCL Publications, London

    Google Scholar 

  11. Lawrence C E, Reilly A A (1990) An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences. PROTEINS: Structure, Function, and Genetics, 7, 41–51

    Article  Google Scholar 

  12. Lawrence C E, Altschul S F, Boguski M S et al (1993) Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science, 262(5131): 208–214

    Article  Google Scholar 

  13. Bailey. T L, Elkan C (1995) Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization. Machine Learning, 21(1/2): 51–80

    Article  Google Scholar 

  14. D’haeseleer M (2006) How does DNA Sequence Motif Discovery Work? Nature Biotechnology, 24, 959–961

    Article  Google Scholar 

  15. Wong A K C, Reichert T A, Aygun B (1974) A Generalized Method for Matching Informational Macromolecular Cod Sequences. Journal of Computers in Biology and Medicine, 4, 43–57

    Article  Google Scholar 

  16. Dan Gusfield (1997) Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology

    Google Scholar 

  17. Wong A K C, Wang Y (1997) High-Order Pattern Discovery from Discrete-Valued Data. IEEE Trans On Knowledge Systems, 9(6): 877–893

    Article  Google Scholar 

  18. Eskin E, Pevzner P (2002) Finding Composite Regulatory Patterns in DNA Sequences. Bioinformatics, 18(1): S354–S363

    Article  Google Scholar 

  19. Blanchette M, Sinha S (2001) Separating Real Motifs from Their Artifacts. Bioinformatics, 17(1): S30–S38

    Article  Google Scholar 

  20. Wong A K C, Reichert T A, Aygun B (1974) A Generalized Method for Matching Informational Macromolecular Cod Sequences. Journal of Computers in Biology and Medicine, 4: 43–57

    Article  Google Scholar 

  21. Wong A K C, Wang C C (1979) DECA-A Discrete-Valued Ensemble Clustering Algorithm. IEEE Trans on Pattern Analysis and Machine Intelligence, PAMI-1(4): 342–349

    Article  MATH  Google Scholar 

  22. Wong A K C, You M L (1985) Entropy and Distance of Random Graphs with Application to Structural Pattern Recognition. IEEE Trans on Pattern Analysis and Machine Intelligence, PAMI-7(5): 599–609

    Article  MATH  Google Scholar 

  23. Chan S C, Wong A K C (1991) Synthesis and Recognition of Sequences. IEEE Trans on PAMI-13(12): 1245–1255

    Article  Google Scholar 

  24. Wong A K C, Chiu D K Y, Chan S C (1995) Pattern Detection in Biomolecules Using Synthesis Random Sequence. Journal of Pattern Recognition, 29(9): 1581–1586

    Article  Google Scholar 

  25. Zhang, Wong A K C (1997) Towards Efficient Multiple Molecular Sequence Alignment. IEEE Trans on SMC, pp 918–932

    Google Scholar 

  26. Wong A K C, G Li (2008) Simultaneous Pattern Clustering and Data Grouping. IEEE Trans Knowl Data Eng, 20(7): 911–923

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Higher Education Press, Beijing and Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Wong, A.K.C., Zhuang, D., Li, G.C.L., Lee, ES.A. (2011). Pattern Discovery and Recognition in Sequences. In: Wang, P.S.P. (eds) Pattern Recognition, Machine Intelligence and Biometrics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22407-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-22407-2_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-22406-5

  • Online ISBN: 978-3-642-22407-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics