Abstract
Today, a huge amount of DNA and protein sequences are available, but the growth of biological knowledge has not kept pace with the increasing data. Hence much more effective computational methods are required to reveal the inherent functional units in these sequences in the form of patterns that could be related back to the biological world and life science applications. Pattern discovery provides such a computational tool. This chapter provides a brief review of known pattern discovery techniques for sequence data and later presents a new pattern discovery framework capable of discovering statistically significant sequence patterns effectively without relying on any prior domain knowledge. In response to the “too many patterns” problem, our algorithm is able to remove redundant patterns including those attributed by their strong statistical significant sub-patterns. It hence renders a compact set of quality patterns making interpretation and further model development much easier. When applying to transcription factor binding site data, it obtains a relatively small set of patterns-14 out of 18 consensus binding sites are associated with our top ranking patterns.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Zhao Q, Bhowmick S S (2003) Sequential Pattern Mining: A Survey, Technical Report, CAIS. Nanyang Technological University, Singapore, No 2003118
Das M K, Dai H K (2007) A Survey of DNA Motif Finding Algorithms. BMC Bioinformatics, 8 (Suppl 7): S21
Tompa M, Li N, Bailey T L et al (2005) Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites. Nature Biotechnology, 23(1): 137–144
Wong A K C, Zhuang D, Gary C L Li et al (2010) Discovery of Non-induced Patterns from Sequences. In: Pattern Recognition in Bioinformatics, pp 149–160
Jonassen I (1996) Efficient Discovery of Conserved Patterns Using a Pattern Graph. Technical Report 118, Department of Informatics, University of Bergen, Norway
Sinha S, Tompa M (2002) Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation. Nucleic Acids Research, 30(24): 5549–5560
Rigoutsos I, A Floratos (1998) Combinatorial Pattern Discovery in Biological Sequences: The TEIRESIAS Algorithm. Bioinformatics, 14(1): 55–67
Parida L, Rigoutsos I, Floratos A et al (2000) Pattern Discovery on Character Sets and Real-valued Data: Linear Bound on Irredundant Motifs and an Efficient Polynomial Time Algorithm. In: Proceedings of the eleventh ACM-SIAM Symposium on Discrete Algorithms, SODA 2000, pp 297–308
Pisanti N, Crochemore M, Grossi R et al (2005) Bases of Motifs for Generating Repeated Patterns with Wild Cards. IEEE/ACM Trans on Computational Biology and Bioinformatics, 2(1): 40–49
Pisanti N, Crochemore M, Grossi R et al (2004) A Comparative Study of Bases for Motif Inference. In String Algorithmics. KCL Publications, London
Lawrence C E, Reilly A A (1990) An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences. PROTEINS: Structure, Function, and Genetics, 7, 41–51
Lawrence C E, Altschul S F, Boguski M S et al (1993) Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science, 262(5131): 208–214
Bailey. T L, Elkan C (1995) Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization. Machine Learning, 21(1/2): 51–80
D’haeseleer M (2006) How does DNA Sequence Motif Discovery Work? Nature Biotechnology, 24, 959–961
Wong A K C, Reichert T A, Aygun B (1974) A Generalized Method for Matching Informational Macromolecular Cod Sequences. Journal of Computers in Biology and Medicine, 4, 43–57
Dan Gusfield (1997) Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology
Wong A K C, Wang Y (1997) High-Order Pattern Discovery from Discrete-Valued Data. IEEE Trans On Knowledge Systems, 9(6): 877–893
Eskin E, Pevzner P (2002) Finding Composite Regulatory Patterns in DNA Sequences. Bioinformatics, 18(1): S354–S363
Blanchette M, Sinha S (2001) Separating Real Motifs from Their Artifacts. Bioinformatics, 17(1): S30–S38
Wong A K C, Reichert T A, Aygun B (1974) A Generalized Method for Matching Informational Macromolecular Cod Sequences. Journal of Computers in Biology and Medicine, 4: 43–57
Wong A K C, Wang C C (1979) DECA-A Discrete-Valued Ensemble Clustering Algorithm. IEEE Trans on Pattern Analysis and Machine Intelligence, PAMI-1(4): 342–349
Wong A K C, You M L (1985) Entropy and Distance of Random Graphs with Application to Structural Pattern Recognition. IEEE Trans on Pattern Analysis and Machine Intelligence, PAMI-7(5): 599–609
Chan S C, Wong A K C (1991) Synthesis and Recognition of Sequences. IEEE Trans on PAMI-13(12): 1245–1255
Wong A K C, Chiu D K Y, Chan S C (1995) Pattern Detection in Biomolecules Using Synthesis Random Sequence. Journal of Pattern Recognition, 29(9): 1581–1586
Zhang, Wong A K C (1997) Towards Efficient Multiple Molecular Sequence Alignment. IEEE Trans on SMC, pp 918–932
Wong A K C, G Li (2008) Simultaneous Pattern Clustering and Data Grouping. IEEE Trans Knowl Data Eng, 20(7): 911–923
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Higher Education Press, Beijing and Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Wong, A.K.C., Zhuang, D., Li, G.C.L., Lee, ES.A. (2011). Pattern Discovery and Recognition in Sequences. In: Wang, P.S.P. (eds) Pattern Recognition, Machine Intelligence and Biometrics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22407-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-22407-2_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22406-5
Online ISBN: 978-3-642-22407-2
eBook Packages: Computer ScienceComputer Science (R0)