Advertisement

An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery

  • Paul Horton
  • Wataru Fujibuchi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3537)

Abstract

Motif discovery is the problem of finding local patterns or motifs from a set of unlabeled sequences. One common representation of a motif is a Markov model known as a score matrix. Matrix based motif discovery has been extensively studied but no positive results have been known regarding its theoretical hardness. We present the first non-trivial upper bound on the complexity (worst-case computation time) of this problem. Other than linear terms, our bound depends only on the motif width w (which is typically 5-20) and is a dramatic improvement relative to previously known bounds.

We prove this bound by relating the motif discovery problem to a search problem over permutations of strings of length w, in which the permutations have a particular property. We give a constructive proof of an upper bound on the number of such permutations. For an alphabet size of σ (typically 4) the trivial bound is \(n! \approx ({\frac{n}{e}})^n, n={\sigma}^w\). Our bound is roughly n(σlog σ n) n .

We relate this theoretical result to the exact motif discovery program, TsukubaBB, whose algorithm contains ideas which inspired the result. We describe a recent improvement to the TsukubaBB program which can give a speed up of nine or more and use a dataset of REB1 transcription factor binding sites to illustrate that exact methods can indeed be used in some practical situations.

Keywords

Input Sequence Exact Algorithm Motif Discovery Score Matrix Alphabet Size 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Akutsu, T., Arimura, H., Shimozono, S.: On approximation algorithms for local multiple alignment. In: Proceedings of the fourth annual international conference on computational molecular biology (RECOMB 2000), pp. 1–7. ACM Press, New York (2000)CrossRefGoogle Scholar
  2. 2.
    Bailey, T., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers. Machine Learning 21, 51–80 (1995)Google Scholar
  3. 3.
    Blekas, K., Fotiados, D., Likas, A.: Greedy mixture learning for multiple motif discovery in biological sequences. Bioinformatics 19(5), 607–617 (2003)CrossRefGoogle Scholar
  4. 4.
    Frith, M., Hansen, U., Spouge, J.L., Weng, Z.: Finding functional sequence elements by multiple local alignment. Nucleic Acids Research (2004)Google Scholar
  5. 5.
    Hertz, G.Z., Hartzell III, G.W., Stormo, G.D.: Identification of consensus patterns in unaligned DNA sequences known to be functionally related. CABIOS 6(2), 81–92 (1990)Google Scholar
  6. 6.
    Hertz, G.Z., Stormo, G.D.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577 (1999)CrossRefGoogle Scholar
  7. 7.
    Horton, P.: A branch and bound algorithm for local multiple alignment. In: Pacific Symposium on Biocomputing 1996, pp. 368–383 (1996)Google Scholar
  8. 8.
    Horton, P.: Tsukuba BB: A branch and bound algorithm for local multiple sequence alignment. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 84–98. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  9. 9.
    Horton, P.: Tsukuba BB: A branch and bound algorithm for local multiple alignment of DNA and protein sequences. Journal of Computational Biology 8(3), 249–282 (2001)CrossRefGoogle Scholar
  10. 10.
    Lawrence, C.E., Altschul, S.F., Boguski, M.B., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993)CrossRefGoogle Scholar
  11. 11.
    Lawrence, C.E., Reilly, A.A.: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. PROTEINS 7, 41–51 (1990)CrossRefGoogle Scholar
  12. 12.
    Li, M., Ma, B., Wang, L.: Finding similar regions in many strings. In: Proceedings of the 32nd Annual ACM Symposium on the Theory of Computing (STOC), pp. 425–434 (1999)Google Scholar
  13. 13.
    Li, M., Ma, B., Wang, L.: Finding similar regions in many sequences. Journal of Computer and System Sciences 65, 73–96 (2002)CrossRefMathSciNetGoogle Scholar
  14. 14.
    Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16, 16–23 (2000)CrossRefGoogle Scholar
  15. 15.
    Zhu, J., Zhang, M.Q.: SCPD: a promoter database of the yeast saccharomyces cerevisiae. Bioinformatics 15, 607–611 (1999)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Paul Horton
    • 1
  • Wataru Fujibuchi
    • 1
  1. 1.Computational Biology Research CenterNational Institute of Advanced Industrial ScienceJapan

Personalised recommendations