Short Segment Frequency Equalization: A Simple and Effective Alternative Treatment of Background Models in Motif Discovery

  • Kazuhito Shida
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5780)


One of the most important pattern recognition problems in bioinformatics is the de novo motif discovery. In particular, there is a large room of improvement in motif discovery from eukaryotic genome, where the sequences have complicated background noise. The short segment frequency equalization (SSFE) is a novel treatment method to incorporate Markov background models into de novo motif discovery algorithms, namely Gibbs sampling. Despite its apparent simplicity, SSFE shows a large performance improvement over the current method (Q/P scheme) when tested on artificial DNA datasets with Markov background of human and mouse. Furthermore, SSFE shows a better performance than other methods including much more complicated and sophisticated method, Weeder 1.3, when tested with several biological datasets from human promoters.


Motif discovery Markov background model Eukaryotic promoters Stochastic method Gibbs sampling 


  1. 1.
    Reddy, T.E., DeLisi, C., Shakhnovich, B.E.: Binding site graphs: A new graph theoretical framework for prediction of transcription factor binding sites. Plos Computational Biology 3, 844–854 (2007)CrossRefGoogle Scholar
  2. 2.
    Mahony, S., Hendrix, D., Golden, A., Smith, T.J., Rokhsar, D.S.: Transcription factor binding site identification using the self-organizing map. Bioinformatics 21, 1807–1814 (2005)CrossRefPubMedGoogle Scholar
  3. 3.
    Liu, X.S., Brutlag, D.L., Liu, J.S.: An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. 20, 835–839 (2002)CrossRefPubMedGoogle Scholar
  4. 4.
    Pevzner, P.A., Sze, S.H.: Combinatorial approaches to finding subtle signals in DNA sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 269–278 (2000)PubMedGoogle Scholar
  5. 5.
    Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14, 55–67 (1998)CrossRefPubMedGoogle Scholar
  6. 6.
    Sinha, S., Tompa, M.: YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research 31, 3586–3588 (2003)CrossRefPubMedPubMedCentralGoogle Scholar
  7. 7.
    Pavesi, G., Zambelli, F., Pesole, G.: WeederH: an algorithm for finding conserved regulatory motifs and regions in homologous sequences. BMC Bioinformatics 8 (2007)Google Scholar
  8. 8.
    Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., Makeev, V.J., Mironov, A.A., Noble, W.S., Pavesi, G., Pesole, G., Regnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., Zhu, Z.: Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23, 137–144 (2005)CrossRefPubMedGoogle Scholar
  9. 9.
    Csuros, M., Noe, L., Kucherov, G.: Reconsidering the significance of genomic word frequencies. Trends in Genetics 23, 543–546 (2007)CrossRefPubMedGoogle Scholar
  10. 10.
    Neuwald, A.F., Liu, J.S., Lawrence, C.E.: Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Sci. 4, 1618–1632 (1995)CrossRefPubMedPubMedCentralGoogle Scholar
  11. 11.
    Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993)CrossRefPubMedGoogle Scholar
  12. 12.
    Frith, M.C., Hansen, U., Spouge, J.L., Weng, Z.: Finding functional sequence elements by multiple local alignment. Nucleic Acids Res. 32, 189–200 (2004)CrossRefPubMedPubMedCentralGoogle Scholar
  13. 13.
    Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 28–36 (1994)PubMedGoogle Scholar
  14. 14.
    Messer, P.W., Bundschuh, R., Vingron, M., Arndt, P.F.: Effects of long-range correlations in DNA on sequence alignment score statistics. Journal of Computational Biology 14, 655–668 (2007)CrossRefPubMedGoogle Scholar
  15. 15.
    Herzel, H., Trifonov, E.N., Weiss, O., Grosse, I.: Interpreting correlations in biosequences. Physica A 249, 449–459 (1998)CrossRefGoogle Scholar
  16. 16.
    Fitch, W.M.: Random Sequences. Journal of Molecular Biology 163, 171–176 (1983)CrossRefPubMedGoogle Scholar
  17. 17.
    Thijs, G., Lescot, M., Marchal, K., Rombauts, S., De Moor, B., Rouze, P., Moreau, Y.: A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17, 1113–1122 (2001)CrossRefPubMedGoogle Scholar
  18. 18.
    Narasimhan, C., LoCascio, P., Uberbacher, E.: Background rareness-based iterative multiple sequence alignment algorithm for regulatory element detection. Bioinformatics 19, 1952–1963 (2003)CrossRefPubMedGoogle Scholar
  19. 19.
    Shida, K.: GibbsST: a Gibbs sampling method for motif discovery with enhanced resistance to local optima. BMC Bioinformatics 7 (2006)Google Scholar
  20. 20.
    Blanco, E., Farre, D., Alba, M.M., Messeguer, X., Guigo, R.: ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic Acids Res. 34, D63–D67 (2006)CrossRefGoogle Scholar
  21. 21.
    Pavesi, G., Mereghetti, P., Mauri, G., Pesole, G.: Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Research 32, W199–W203 (2004)CrossRefGoogle Scholar
  22. 22.
    van Helden, J.: The analysis of regulatory sequences. In: Chatenay, D., Cocco, S., Monasson, R., Thieffry, D., Dailbard, J. (eds.) Multiple aspects of DNA and RNA from biophysics to bioinformatics, pp. 271–304. Elsevier, Amsterdam (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Kazuhito Shida
    • 1
  1. 1.Institute for Material ResearchSendaiJapan

Personalised recommendations