Simultaneously Learning DNA Motif along with Its Position and Sequence Rank Preferences through EM Algorithm

  • ZhiZhuo Zhang
  • Cheng Wei Chang
  • Willy Hugo
  • Edwin Cheung
  • Wing-Kin Sung
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7262)


Although de novo motifs can be discovered through mining over-represented sequence patterns, this approach misses some real motifs and generates many false positives. To improve accuracy, one solution is to consider some additional binding features (i.e. position preference and sequence rank preference). This information is usually required from the user. This paper presents a de novo motif discovery algorithm called SEME which uses pure probabilistic mixture model to model the motif’s binding features and uses expectation maximization (EM) algorithms to simultaneously learn the sequence motif, position and sequence rank preferences without asking for any prior knowledge from the user. SEME is both efficient and accurate thanks to two important techniques: the variable motif length extension and importance sampling. Using 75 large scale synthetic datasets, 32 metazoan compendium benchmark datasets and 164 ChIP-Seq libraries, we demonstrated the superior performance of SEME over existing programs in finding transcription factor (TF) binding sites. SEME is further applied to a more difficult problem of finding the co-regulated TF (co-TF) motifs in 15 ChIP-Seq libraries. It identified significantly more correct co-TF motifs and, at the same time, predicted co-TF motifs with better matching to the known motifs. Finally, we show that the learned position and sequence rank preferences of each co-TF reveals potential interaction mechanisms between the primary TF and the co-TF within these sites. Some of these findings were further validated by the ChIP-Seq experiments of the co-TFs.


Motif Finding Expectation Maximization Importance Sampling Binding Preference 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ashburner, M.: Gene ontology: Tool for the unification of biology. Nature Genetics 25, 25–29 (2000)CrossRefGoogle Scholar
  2. 2.
    Bailey, T.L.: Dreme: Motif discovery in transcription factor chip-seq data. Bioinformatics 27(12), 1653 (2011)CrossRefGoogle Scholar
  3. 3.
    Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 2, pp. 28–36 (1994)Google Scholar
  4. 4.
    Berger, M.F., Bulyk, M.L.: Protein binding microarrays (pbms) for rapid, high-throughput characterization of the sequence specificities of dna binding proteins. Methods in Molecular Biology-Clifton then Totowa 338, 245 (2006)Google Scholar
  5. 5.
    Chen, X., Hughes, T.R., Morris, Q.: Rankmotif++: a motif-search algorithm that accounts for relative ranks of k-mers in binding transcription factors. Bioinformatics 23(13), i72 (2007)CrossRefGoogle Scholar
  6. 6.
    Chen, X., Xu, H., Yuan, P., Fang, F., Huss, M., Vega, V.B., Wong, E., Orlov, Y.L., Zhang, W., Jiang, J., et al.: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133(6), 1106–1117 (2008)CrossRefGoogle Scholar
  7. 7.
    Ettwiller, L., Paten, B., Ramialison, M., Birney, E., Wittbrodt, J.: Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation. Nature Methods 4(7), 563–565 (2007)CrossRefGoogle Scholar
  8. 8.
    Euskirchen, G.M., Rozowsky, J.S., Wei, C.L., Lee, W.H., Zhang, Z.D., Hartman, S., Emanuelsson, O., Stolc, V., Weissman, S., Gerstein, M.B., et al.: Mapping of transcription factor binding regions in mammalian cells by chip: comparison of array-and sequencing-based technologies. Genome Research 17(6), 898 (2007)CrossRefGoogle Scholar
  9. 9.
    Frith, M.C., Hansen, U., Spouge, J.L., Weng, Z.: Finding functional sequence elements by multiple local alignment. Nucleic Acids Research 32(1), 189 (2004)CrossRefGoogle Scholar
  10. 10.
    Gao, N., Zhang, J., Rao, M.A., Case, T.C., Mirosevich, J., Wang, Y., Jin, R., Gupta, A., Rennie, P.S., Matusik, R.J.: The role of hepatocyte nuclear factor-3α (forkhead box a1) and androgen receptor in transcriptional regulation of prostatic genes. Molecular Endocrinology 17(8), 1484 (2003)CrossRefGoogle Scholar
  11. 11.
    Glynn, P.W., Iglehart, D.L.: Importance sampling for stochastic simulations. Management Science, 1367–1392 (1989)Google Scholar
  12. 12.
    Hu, M., Yu, J., Taylor, J.M.G., Chinnaiyan, A.M., Qin, Z.S.: On the detection and refinement of transcription factor binding sites using chip-seq data. Nucleic Acids Research 38(7), 2154 (2010)CrossRefGoogle Scholar
  13. 13.
    Keilwagen, J., Grau, J., Paponov, I.A., Posch, S., Strickert, M., Grosse, I.: De-novo discovery of differentially abundant transcription factor binding sites including their positional preference. PLoS Computational Biology 7(2), e1001070 (2011)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Kong, S.L., Li, G., Loh, S.L., Sung, W.K., Liu, E.T.: Cellular reprogramming by the conjoint action of erα, foxa1, and gata3 to a ligand-inducible growth state. Molecular Systems Biology 7(1) (2011)Google Scholar
  15. 15.
    Kulakovskiy, I.V., Boeva, V.A., Favorov, A.V., Makeev, V.J.: Deep and wide digging for binding motifs in chip-seq data. Bioinformatics 26(20), 2622 (2010)CrossRefGoogle Scholar
  16. 16.
    Lam, T.W., Sadakane, K., Sung, W.K., Yiu, S.M.: A space and time efficient algorithm for constructing compressed suffix arrays. Computing and Combinatorics, 21–26 (2002)Google Scholar
  17. 17.
    Linhart, C., Halperin, Y., Shamir, R.: Transcription factor and microRNA motif discovery: The Amadeus platform and a compendium of metazoan target sets. Genome Research 18(7), 1180 (2008)CrossRefGoogle Scholar
  18. 18.
    Liu, X.S., Brutlag, D.L., Liu, J.S.: An algorithm for finding protein–dna binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature Biotechnology 20(8), 835–839 (2002)Google Scholar
  19. 19.
    Liu, Y., Schmidt, B., Liu, W., Maskell, D.L.: CUDA-MEME: Accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units. Pattern Recognition Letters (2009)Google Scholar
  20. 20.
    Mahony, S., Auron, P.E., Benos, P.V.: Dna familial binding profiles made easy: comparison of various motif alignment and clustering strategies. PLoS Computational Biology 3(3), e61 (2007)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Narang, V., Mittal, A., Sung, W.K.: Localized motif discovery in gene regulatory sequences. Bioinformatics 26(9), 1152 (2010)CrossRefGoogle Scholar
  22. 22.
    Pavesi, G., Mauri, G., Pesole, G.: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17(suppl. 1), 207–214 (2001)CrossRefGoogle Scholar
  23. 23.
    Raphael, B., Liu, L.T., Varghese, G.: A uniform projection method for motif discovery in dna sequences. IEEE Transactions on Computational biology and Bioinformatics, 91–94 (2004)Google Scholar
  24. 24.
    Reid, J.E., Wernisch, L.: Steme: efficient em to find motifs in large data sets. Nucleic Acids Research 39(18), e126–e126 (2011)CrossRefGoogle Scholar
  25. 25.
    Roth1JT, F.P., Hughes, J.D., Estep, P.W., Church, G.M.: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnology  16, 939 (1998)CrossRefGoogle Scholar
  26. 26.
    Sahu, B., Laakso, M., Ovaska, K., Mirtti, T., Lundin, J., Rannikko, A., Sankila, A., Turunen, J.P., Lundin, M., Konsti, J., et al.: Dual role of foxa1 in androgen receptor binding to chromatin, androgen signalling and prostate cancer. The EMBO Journal 30(19), 3962–3976 (2011)CrossRefGoogle Scholar
  27. 27.
    Sharov, A.A., Ko, M.S.H.: Exhaustive Search for Over-represented DNA Sequence Motifs with CisFinder. DNA Research (2009)Google Scholar
  28. 28.
    Sinha, S.: On counting position weight matrix matches in a sequence, with application to discriminative motif finding. Bioinformatics 22(14) (2006)Google Scholar
  29. 29.
    Sinha, S., Tompa, M.: A statistical method for finding transcription factor binding sites. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp. 344–354 (2000)Google Scholar
  30. 30.
    Valouev, A., Johnson, D.S., Sundquist, A., Medina, C., Anton, E., Batzoglou, S., Myers, R.M., Sidow, A.: Genome-wide analysis of transcription factor binding sites based on chip-seq data. Nature Methods 5(9), 829 (2008)CrossRefGoogle Scholar
  31. 31.
    Wasserman, W.W., Sandelin, A.: Applied bioinformatics for the identification of regulatory elements. Nature Reviews Genetics 5(4), 276–287 (2004)CrossRefGoogle Scholar
  32. 32.
    Wu, Q., Ng, H.H.: Mark the transition: chromatin modifications and cell fate decision. Cell Research (2011)Google Scholar
  33. 33.
    Zhang, Z., Chang, C.W., Goh, W.L., Sung, W.K., Cheung, E.: Centdist: discovery of co-associated factors by motif distribution. Nucleic Acids Research 39(suppl. 2), W391 (2011)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • ZhiZhuo Zhang
    • 1
  • Cheng Wei Chang
    • 2
  • Willy Hugo
    • 1
  • Edwin Cheung
    • 2
  • Wing-Kin Sung
    • 1
    • 2
  1. 1.National University of SingaporeSingapore
  2. 2.Genome Institute of SingaporeSingapore

Personalised recommendations