Discovery of Non-induced Patterns from Sequences

  • Andrew K. C. Wong
  • Dennis Zhuang
  • Gary C. L. Li
  • En-Shiun Annie Lee
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6282)

Abstract

Discovering patterns from sequence data has significant impact in genomics, proteomics and business. A problem commonly encountered is that the patterns discovered often contain many redundancies resulted from fake significant patterns induced by their strong statistically significant subpatterns. The concept of statistically induced patterns is proposed to capture these redundancies. An algorithm is then developed to efficiently discover non-induced significant patterns from a large sequence dataset. For performance evaluation, two experiments were conducted to demonstrate a) the seriousness of the problem using synthetic data and b) top non-induced significant patterns discovered from Saccharomyces cerevisiae (Yeast) do correspond to the transcription factor binding sites found by the biologists. The experiments confirm the effectiveness of our method in generating a relatively small set of patterns revealing interesting, unknown information inherent in the sequences.

Keywords

Sequence Pattern Discovery Statistically Induced Patterns Suffix Tree 

References

  1. 1.
    Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 14(1), 55–67 (1998)CrossRefPubMedGoogle Scholar
  2. 2.
    Parida, L., Rigoutsos, I., Floratos, A., Platt, D., Gao, Y.: Pattern Discovery on Character Sets and Real-valued Data: Linear Bound on Irredundant Motifs and an Efficient Polynomial Time Algorithm. In: Proceedings of the eleventh ACM-SIAM Symposium on Discrete Algorithms, pp. 297–308 (January 2000)Google Scholar
  3. 3.
    Sinha, S., Tompa, M.: Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research 30(24), 5549–5560 (2002)CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Apostolico, A., Bock, M., Lonardi, S., Xu, X.: Efficient Detection of Unusual Words. Journal of Computational Biology 7(1/2), 71–94 (2000)CrossRefPubMedGoogle Scholar
  5. 5.
    Eskin, E., Pevzner, P.: Finding composite regulatory patterns in DNA sequences. Bioinformatics 18(suppl. 1), S354–S363 (2002)CrossRefGoogle Scholar
  6. 6.
    Marsan, L., Sagot, M.: Extracting structured motifs using a suffix tree - Algorithms and application to promoter consensus identification. Journal of Computational Biology 7(3/4), 345–362 (2000)CrossRefPubMedGoogle Scholar
  7. 7.
    Blanchette, M., Sinha, S.: Separating real motifs from their artifacts. Bioinformatics 17(suppl. 1), S30–S38 (2001)CrossRefGoogle Scholar
  8. 8.
    Sze, S., Lu, S., Chen, J.: Integrating Sample-Driven and Pattern-Driven Approaches in Motif Finding. In: Algorithms in Bioinformatics: 4th International Workshop, pp. 438–449 (2004)Google Scholar
  9. 9.
    Bailey, T.L., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21, 51–80 (1995)Google Scholar
  10. 10.
    Pavesi, G., Zambelli, F., Pesole, G.: WeederH: an algorithm for finding conserved regulatory motifs and regions in homologous sequences. BMC Bioinformatics 8, 46 (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  11. 11.
    Haberman, S.: The Analysis of Residuals in Cross-Classified Tables. Biometrics 29, 205–220 (1973)CrossRefGoogle Scholar
  12. 12.
    Wong, A., Wang, Y.: High-Order Pattern Discovery from Discrete-Valued Data. IEEE Trans on Knowledge Systems 9(6), 877–893 (1997)Google Scholar
  13. 13.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology (1997)Google Scholar
  14. 14.
  15. 15.
    Tompa, M., Li, N., Bailey, T., Church, G., Moor, B., Eskin, E., Favorov, A., Frith, M., Fu, Y., Kent, W., Makeev, V., Mironov, A., Noble, W., Pavesi, G., Pe-sole, G., Regnier, M., Simonis, N., Sinha, S., Thijs, G., Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., Zhu, Z.: Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites. Nature Biotechnology 23(1), 137–144 (2005)CrossRefPubMedGoogle Scholar
  16. 16.
    Wong, A., Li, G.: Simultaneous Pattern Clustering and Data Grouping. IEEE Trans. Knowledge and Data Engineering 20(7), 911–923 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Andrew K. C. Wong
    • 1
  • Dennis Zhuang
    • 1
  • Gary C. L. Li
    • 1
  • En-Shiun Annie Lee
    • 1
  1. 1.Department of System DesignUniversity of WaterlooWaterlooCanada

Personalised recommendations