Abstract
We define the best consensus motif (BCM) problem motivated by the problem of extracting motifs from nucleic acid and amino acid sequences. A type over an alphabet Σ is a family Ω of subsets of Σ *. A motif π of type Ω is a string π=π 1 ⋯ π n of motif components, each of which stands for an element in Ω. The BCM problem for Ω is, given a yes-no sample S=(α (1),β(1),..., (α(m),β(m))} of pairs of strings in Σ* with α (i) ≠β(i) for 1 ≤ i ≤ m, to find a motif π of type Ω that maximizes the number of good pairs in S, where (α (i), β (i)) is good for π if π accepts α (i) and rejects β (i) We prove that the BCM problem is NP-complete even for a very simple type Ω 1=2∑ −{θ}, which is used, in practice, for describing protein motifs in the PROSITE database. We also show that the NP-completeness of the problem does not change for the type Ω ∞=Ω1∪ {Σ+}∪{Σ[i,j]¦1≤i≤ j}, where Σ [i,j] is the set of strings over Σ of length between i and j Furthermore, for the BCM problem for Ω 1 we provide a polynomial-time greedy algorithm based on the probabilistic method. Its performance analysis shows an explicit approximation ratio of the algorithm.
Preview
Unable to display preview. Download preview PDF.
References
Angluin, D., Finding patterns common to a set of strings, J. Comput. System Sci. 21 (1980) 46–62.
Arikawa, S., Miyano, S., Shinohara, A., Kuhara, S., Mukouchi, Y., and Shinohara, T., A machine discovery from amino acid sequences by decision trees over regular patterns, New Generation Computing 11 (1993) 361–375.
Bairoch, A., PROSITE: a dictionary of sites and patterns in proteins, Nucleic Acids Res. 19 (1991) 2241–2245.
Garey, M.R., Johnson, D.S. and Stockmeyer, L., Some simplified NP-complete problems, Theoret. Comput. Sci. 1 (1976) 237–267.
Gribskov, M. and Devereux, J., Sequence Analysis Primer, Stockholm Press, 1991.
Helgesen, C. and Sibbald, P.R., PALM — A pattern language for molecular biology, Proc. First International Conference on Intelligent Systems for Morecular Biology, 1993, 172–180.
Jiang, T. and Li, M., On the complexity of learning strings and sequences, Proc. 4th Workshop on Computational Learning Theory, 1991, 367–371.
Miyano, S., Shinohara, A. and Shinohara, T., Which classes of elementary formal systems are polynomial-time learnable?, Proc. Second Workshop on Algorithmic Learning Theory, 1991, 139–150.
Papadimitriou, C.H., Computational Complexity, Addison-Wesley, 1994.
Quinlan, J.R., Induction on decision trees, Machine Learning 1 (1986) 81–106.
Shimozono, S., Shinohara, A., Shinohara, T., Miyano, S., Kuhara, S., and Arikawa, S., Knowledge acquisition from amino acid sequences by machine learning system BONSAI, Transactions of Information Processing Society of Japan 35 (1994) 2009–2018.
Shinohara, T., Polynomial time inference of extended regular pattern languages, Lecture Notes in Computer Science 147 (1983) 115–127.
Shoudai, T., Lappe, M., Miyano, S., Shinohara, A., Okazaki, T., Arikawa, S., Uchida, T., Shimozono, S., Shinohara, T., and Kuhara, S., BONSAI Garden: parallel knowledge discovery system for amino acid sequences, Proc. Third International Conference on Intelligent Systems for Molecular Biology (AAAI Press), 1995, 359–366.
Tateishi, E. and Miyano, S., A greedy strategy for finding motifs from positive and negative examples, to appear in Proc. First Pacific Symposium on Biocomputing, 1996.
Yannakakis, M., On the approximation of maximum satisfiability, J. Algorithms 17 (1994) 475–502.
Author information
Authors and Affiliations
Corresponding author
Editor information
Rights and permissions
Copyright information
© 1996 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tateishi, E., Maruyama, O., Miyano, S. (1996). Extracting best consensus motifs from positive and negative examples. In: Puech, C., Reischuk, R. (eds) STACS 96. STACS 1996. Lecture Notes in Computer Science, vol 1046. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60922-9_19
Download citation
DOI: https://doi.org/10.1007/3-540-60922-9_19
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-60922-3
Online ISBN: 978-3-540-49723-3
eBook Packages: Springer Book Archive