Pattern Discovery and Recognition in Sequences

Wong, Andrew K. C.; Zhuang, Dennis; Li, Gary C. L.; Lee, En-Shiun Annie

doi:10.1007/978-3-642-22407-2_2

Andrew K. C. Wong,
Dennis Zhuang,
Gary C. L. Li &
…
En-Shiun Annie Lee²

2718 Accesses

Abstract

Today, a huge amount of DNA and protein sequences are available, but the growth of biological knowledge has not kept pace with the increasing data. Hence much more effective computational methods are required to reveal the inherent functional units in these sequences in the form of patterns that could be related back to the biological world and life science applications. Pattern discovery provides such a computational tool. This chapter provides a brief review of known pattern discovery techniques for sequence data and later presents a new pattern discovery framework capable of discovering statistically significant sequence patterns effectively without relying on any prior domain knowledge. In response to the “too many patterns” problem, our algorithm is able to remove redundant patterns including those attributed by their strong statistical significant sub-patterns. It hence renders a compact set of quality patterns making interpretation and further model development much easier. When applying to transcription factor binding site data, it obtains a relatively small set of patterns-14 out of 18 consensus binding sites are associated with our top ranking patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Zhao Q, Bhowmick S S (2003) Sequential Pattern Mining: A Survey, Technical Report, CAIS. Nanyang Technological University, Singapore, No 2003118
Google Scholar
Das M K, Dai H K (2007) A Survey of DNA Motif Finding Algorithms. BMC Bioinformatics, 8 (Suppl 7): S21
Article Google Scholar
Tompa M, Li N, Bailey T L et al (2005) Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites. Nature Biotechnology, 23(1): 137–144
Article Google Scholar
Wong A K C, Zhuang D, Gary C L Li et al (2010) Discovery of Non-induced Patterns from Sequences. In: Pattern Recognition in Bioinformatics, pp 149–160
Google Scholar
Jonassen I (1996) Efficient Discovery of Conserved Patterns Using a Pattern Graph. Technical Report 118, Department of Informatics, University of Bergen, Norway
Google Scholar
Sinha S, Tompa M (2002) Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation. Nucleic Acids Research, 30(24): 5549–5560
Article Google Scholar
Rigoutsos I, A Floratos (1998) Combinatorial Pattern Discovery in Biological Sequences: The TEIRESIAS Algorithm. Bioinformatics, 14(1): 55–67
Article Google Scholar
Parida L, Rigoutsos I, Floratos A et al (2000) Pattern Discovery on Character Sets and Real-valued Data: Linear Bound on Irredundant Motifs and an Efficient Polynomial Time Algorithm. In: Proceedings of the eleventh ACM-SIAM Symposium on Discrete Algorithms, SODA 2000, pp 297–308
Google Scholar
Pisanti N, Crochemore M, Grossi R et al (2005) Bases of Motifs for Generating Repeated Patterns with Wild Cards. IEEE/ACM Trans on Computational Biology and Bioinformatics, 2(1): 40–49
Article Google Scholar
Pisanti N, Crochemore M, Grossi R et al (2004) A Comparative Study of Bases for Motif Inference. In String Algorithmics. KCL Publications, London
Google Scholar
Lawrence C E, Reilly A A (1990) An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences. PROTEINS: Structure, Function, and Genetics, 7, 41–51
Article Google Scholar
Lawrence C E, Altschul S F, Boguski M S et al (1993) Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science, 262(5131): 208–214
Article Google Scholar
Bailey. T L, Elkan C (1995) Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization. Machine Learning, 21(1/2): 51–80
Article Google Scholar
D’haeseleer M (2006) How does DNA Sequence Motif Discovery Work? Nature Biotechnology, 24, 959–961
Article Google Scholar
Wong A K C, Reichert T A, Aygun B (1974) A Generalized Method for Matching Informational Macromolecular Cod Sequences. Journal of Computers in Biology and Medicine, 4, 43–57
Article Google Scholar
Dan Gusfield (1997) Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology
Google Scholar
Wong A K C, Wang Y (1997) High-Order Pattern Discovery from Discrete-Valued Data. IEEE Trans On Knowledge Systems, 9(6): 877–893
Article Google Scholar
Eskin E, Pevzner P (2002) Finding Composite Regulatory Patterns in DNA Sequences. Bioinformatics, 18(1): S354–S363
Article Google Scholar
Blanchette M, Sinha S (2001) Separating Real Motifs from Their Artifacts. Bioinformatics, 17(1): S30–S38
Article Google Scholar
Wong A K C, Reichert T A, Aygun B (1974) A Generalized Method for Matching Informational Macromolecular Cod Sequences. Journal of Computers in Biology and Medicine, 4: 43–57
Article Google Scholar
Wong A K C, Wang C C (1979) DECA-A Discrete-Valued Ensemble Clustering Algorithm. IEEE Trans on Pattern Analysis and Machine Intelligence, PAMI-1(4): 342–349
Article MATH Google Scholar
Wong A K C, You M L (1985) Entropy and Distance of Random Graphs with Application to Structural Pattern Recognition. IEEE Trans on Pattern Analysis and Machine Intelligence, PAMI-7(5): 599–609
Article MATH Google Scholar
Chan S C, Wong A K C (1991) Synthesis and Recognition of Sequences. IEEE Trans on PAMI-13(12): 1245–1255
Article Google Scholar
Wong A K C, Chiu D K Y, Chan S C (1995) Pattern Detection in Biomolecules Using Synthesis Random Sequence. Journal of Pattern Recognition, 29(9): 1581–1586
Article Google Scholar
Zhang, Wong A K C (1997) Towards Efficient Multiple Molecular Sequence Alignment. IEEE Trans on SMC, pp 918–932
Google Scholar
Wong A K C, G Li (2008) Simultaneous Pattern Clustering and Data Grouping. IEEE Trans Knowl Data Eng, 20(7): 911–923
Article Google Scholar

Download references

Author information

Authors and Affiliations

PAMI Group, Department of System Design, University of Waterloo, Canada
En-Shiun Annie Lee

Authors

Andrew K. C. Wong
View author publications
You can also search for this author in PubMed Google Scholar
Dennis Zhuang
View author publications
You can also search for this author in PubMed Google Scholar
Gary C. L. Li
View author publications
You can also search for this author in PubMed Google Scholar
En-Shiun Annie Lee
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CCIS, Northeastern University, 360 Huntington Ave, Boston, MA, 02110, USA
Patrick S. P. Wang

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wong, A.K.C., Zhuang, D., Li, G.C.L., Lee, ES.A. (2011). Pattern Discovery and Recognition in Sequences. In: Wang, P.S.P. (eds) Pattern Recognition, Machine Intelligence and Biometrics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22407-2_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-22407-2_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22406-5
Online ISBN: 978-3-642-22407-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics