Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability

  • Bin Fu
  • Ming-Yang Kao
  • Lusheng Wang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5532)


We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet Σ. A motif G = g 1 g 2...g m is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b 1 b 2...b m of G, every character is probabilistically generated such that the probability for b i  ≠ g i is at most α. It has been conjectured that multiple background sequences can help with finding faint motifs G.

In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet Σ with |Σ| ≥ 2 and is applicable to DNA motif discovery. We prove that for \(\alpha<{1\over 4}(1-{1\over |\Sigma|})\) and any constant x ≥ 8, there exist positive constants c 0, ε, δ 1 and δ 2 such that if the length ρ of motif G is at least δ 1 logn, and there are k ≥ c 0 logn input sequences, then in O(n 2 + kn) time this algorithm finds the motif with probability at least \(1-{1\over 2^x}\) for every \(G\in \Sigma^{\rho}-\Psi_{\rho, h,\epsilon}(\Sigma)\), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ 2logn, and Ψ ρ, h,ε (Σ) is a small subset of at most \(2^{-\Theta(\epsilon^2 h)}\) fraction of the sequences in Σ ρ . The constants c 0, ε, δ 1 and δ 2 do not depend on x when x is a parameter of order O(logn). Our algorithm can take any number k sequences as input.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Chin, F., Leung, H.: Voting algorithms for discovering long motifs. In: Proceedings of the 3rd Asia-Pacific Bioinformatics Conference, pp. 261–272 (2005)Google Scholar
  2. 2.
    Dopazo, J., Rodríguez, A., Sáiz, J.C., Sobrino, F.: Design of primers for PCR amplification of highly variable genomes. Computer Applications in the Biosciences 9, 123–125 (1993)Google Scholar
  3. 3.
    Frances, M., Litman, A.: On covering problems of codes. Theoretical Computer Science 30, 113–119 (1997)zbMATHMathSciNetGoogle Scholar
  4. 4.
    Fu, B., Kao, M.-Y., Wang, L.: Efficient algorithms for model-based motif discovery from multiple sequences. In: Agrawal, M., Du, D.-Z., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 234–245. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  5. 5.
    Ga̧sieniec, L., Jansson, J., Lingas, A.: Efficient approximation algorithms for the Hamming center problem. In: Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. S905–S906 (1999)Google Scholar
  6. 6.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)zbMATHGoogle Scholar
  7. 7.
    Hertz, G., Stormo, G.: Identification of consensus patterns in unaligned DNA and protein sequences: a large-deviation statistical basis for penalizing gaps. In: Proceedings of the 3rd International Conference on Bioinformatics and Genome Research, pp. 201–216 (1995)Google Scholar
  8. 8.
    Keich, U., Pevzner, P.: Finding motifs in the twilight zone. Bioinformatics 18, 1374–1381 (2002)CrossRefGoogle Scholar
  9. 9.
    Keich, U., Pevzner, P.: Subtle motifs: defining the limits of motif finding algorithms. Bioinformatics 18, 1382–1390 (2002)CrossRefGoogle Scholar
  10. 10.
    Lanctot, J.K., Li, M., Ma, B., Wang, L., Zhang, L.: Distinguishing string selection problems. In: Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 633–642 (1999)Google Scholar
  11. 11.
    Lawrence, C., Reilly, A.: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7, 41–51 (1990)CrossRefGoogle Scholar
  12. 12.
    Li, M., Ma, B., Wang, L.: Finding similar regions in many strings. In: Proceedings of the 31st Annual ACM Symposium on Theory of Computing, pp. 473–482 (1999)Google Scholar
  13. 13.
    Li, M., Ma, B., Wang, L.: On the closest string and substring problems. Journal of the ACM 49(2), 157–171 (2002)CrossRefMathSciNetGoogle Scholar
  14. 14.
    Lucas, K., Busch, M., Mossinger, S., Thompson, J.: An improved microcomputer program for finding gene- or gene family-specific oligonucleotides suitable as primers for polymerase chain reactions or as probes. Computer Applications in the Biosciences 7, 525–529 (1991)Google Scholar
  15. 15.
    Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (2000)Google Scholar
  16. 16.
    Pevzner, P., Sze, S.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, pp. 269–278 (2000)Google Scholar
  17. 17.
    Proutski, V., Holme, E.C.: Primer master: a new program for the design and analysis of PCR primers. Computer Applications in the Biosciences 12, 253–255 (1996)Google Scholar
  18. 18.
    Stormo, G.: Consensus patterns in DNA. In: Doolitle, R.F. (ed.) Molecular evolution: computer analysis of protein and nucleic acid sequences. Methods in Enzymology, vol. 183, pp. 211–221 (1990)Google Scholar
  19. 19.
    Stormo, G., Hartzell III, G.: Identifying protein-binding sites from unaligned DNA fragments. In: Proceedings of the National Academy of Sciences of the United States of America, vol. 88, pp. 5699–5703 (1991)Google Scholar
  20. 20.
    Wang, L., Dong, L.: Randomized algorithms for motif detection. Journal of Bioinformatics and Computational Biology 3(5), 1039–1052 (2005)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Bin Fu
    • 1
  • Ming-Yang Kao
    • 2
  • Lusheng Wang
    • 3
  1. 1.Dept. of Computer ScienceUniversity of Texas – Pan AmericanUSA
  2. 2.Department of Electrical Engineering and Computer ScienceNorthwestern UniversityEvanstonUSA
  3. 3.Department of Computer ScienceThe City University of Hong Kong, KowloonHong Kong

Personalised recommendations