Measuring Over-Generalization in the Minimal Multiple Generalizations of Biosequences

Ng, Yen Kaow; Ono, Hirotaka; Shinohara, Takeshi

doi:10.1007/11563983_16

Yen Kaow Ng²¹,
Hirotaka Ono²² &
Takeshi Shinohara²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3735))

Included in the following conference series:

International Conference on Discovery Science

705 Accesses
3 Citations

Abstract

We consider the problem of finding a set of patterns that best characterizes a set of strings. To this end, Arimura et. al. [3] considered the use of minimal multiple generalizations (mmg) for such characterizations. Given any sample set, the mmgs are, roughly speaking, the most (syntactically) specific set of languages containing the sample within a given class of languages. Takae et. al. [17] found the mmgs of the class of pattern languages [1] which includes so-called sort symbols to be fairly accurate as predictors for signal peptides. We first reproduce their results using updated data. Then, by using a measure for estimating the level of over-generalizations made by the mmgs, we show results that explain the high level of accuracies resulting from the use of sort symbols, and discuss how better results can be obtained. The measure that we suggests here can also be applied to other types of patterns, e.g. the PROSITE patterns [4].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Angluin, D.: Finding patterns common to a set of strings. Journal of Computer and System Sciences 21, 46–62 (1980)
Article MathSciNet MATH Google Scholar
Arimura, H., Fujino, R., Shinohara, T., Arikawa, S.: Protein motif discovery from positive examples by Minimal Multiple Generalization over regular patterns. In: Proceedings of the Genome Informatics Workshop, pp. 39–48 (1994)
Google Scholar
Arimura, H., Shinohara, T., Otsuki, S.: Finding minimal generalizations for unions of pattern languages and its application to inductive inference from positive data. In: Enjalbert, P., Mayr, E.W., Wagner, K.W. (eds.) STACS 1994. LNCS, vol. 775, pp. 649–660. Springer, Heidelberg (1994)
Google Scholar
Bairoch, A.: PROSITE: A dictionary of sites and patterns in proteins. Nucl. Acids Res. 25(19), 2241–2245 (1991)
Google Scholar
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: Genbank: update. Nucl. Acids Res. 32(Database-Issue), 23–26 (2004)
Article Google Scholar
Brāzma, A., Jonassen, I., Eidhammer, I., Gilbert, D.: Approaches to the automatic discovery of patterns in biosequences. J. Comp. Biol. 5(2), 277–304 (1998)
Google Scholar
Brejova, B., Vinar, T., Li, M.: Pattern Discovery: Methods and Software, Ch. 29, pp. 491–522. Humana Press (2003)
Google Scholar
Case, J., Jain, S., Reischuk, R., Stephan, F., Zeugmann, T.: Learning a subclass of regular patterns in polynomial time. In: Gavaldá, R., Jantke, K.P., Takimoto, E. (eds.) ALT 2003. LNCS (LNAI), vol. 2842, pp. 234–246. Springer, Heidelberg (2003)
Chapter Google Scholar
Chan, C., Garofalakis, M., Rastogi, R.: RE-tree: an efficient index structure for regular expressions. The VLDB Journal 12(2), 102–119 (2003)
Article Google Scholar
Hopcroft, J., Ullman, J.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading (1979)
MATH Google Scholar
Kannan, S., Sweedyk, Z., Mahaney, S.: Counting and random generation of strings in regular languages. In: Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, pp. 551–557 (1995)
Google Scholar
Ng, Y.K., Shinohara, T.: Inferring unions of the pattern languages by the most fitting covers. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI), vol. 3734, pp. 269–282. Springer, Heidelberg (2005)
Chapter Google Scholar
Ono, H., Ng, Y.K.: Best fitting fixed-length substring patterns for a set of strings. In: Proceedings of The Eleventh International Computing and Combinatorics Conference (COCOON 2005) (2005) (to appear)
Google Scholar
Shinohara, A.: String pattern discovery. In: Ben-David, S., Case, J., Maruoka, A. (eds.) ALT 2004. LNCS (LNAI), vol. 3244, pp. 1–13. Springer, Heidelberg (2004)
Chapter Google Scholar
Shinohara, T.: Polynomial time inference of extended regular pattern languages. In: Goto, E., Nakajima, R., Yonezawa, A., Nakata, I., Furukawa, K. (eds.) RIMS 1982. LNCS, vol. 147, pp. 115–127. Springer, Heidelberg (1983)
Google Scholar
Shinohara, T., Ng, Y.K.: Strong biases for the minimal multiple generalization algorithm on samples of very small sizes. In: The Proceedings of the 57th Meeting of SIG-FPAI, The Japanese Society of Artificial Intelligence (November 2004)
Google Scholar
Takae, T., Kasai, T., Arimura, H., Shinohara, T.: Knowledge discovery in biosequences using sort regular patterns. In: Workshop on Applied Learning Theory (1998)
Google Scholar
Uemura, J., Sato, M.: Compactness and learning of classes of unions of erasing regular pattern languages. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. (eds.) ALT 2002. LNCS (LNAI), vol. 2533, pp. 293–307. Springer, Heidelberg (2002)
Chapter Google Scholar
Yamaguchi, M., Shimozono, S., Shinohara, T.: Finding minimal multiple generalization over regular patterns with alphabet indexing. In: Proceedings of the Seventh Workshop on Genome Informatics, vol. 7, pp. 51–60. Universal Academy Press (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Computer Science and Systems, Kyushu Institute of Technology, Iizuka, 820, Japan
Yen Kaow Ng
Department of Computer Science and Communication Engineering, Kyushu University, 6-10-1, Hakozaki, Fukuoka, 812-8581, Japan
Hirotaka Ono
Department of Artificial Intelligence, Kyushu Institute of Technology, Iizuka, 820, Japan
Takeshi Shinohara

Authors

Yen Kaow Ng
View author publications
You can also search for this author in PubMed Google Scholar
Hirotaka Ono
View author publications
You can also search for this author in PubMed Google Scholar
Takeshi Shinohara
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science & Engineering, The University of New South Wales, Sydney, Australia
Achim Hoffmann
Institute of Scientific and Industrial Research, Osaka University, 8-1 Mihogaoka, 567-0047, Ibaraki, Osaka, Japan
Hiroshi Motoda
Max Planck Institute for Computer Science, Saarbrücken, Germany
Tobias Scheffer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ng, Y.K., Ono, H., Shinohara, T. (2005). Measuring Over-Generalization in the Minimal Multiple Generalizations of Biosequences. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds) Discovery Science. DS 2005. Lecture Notes in Computer Science(), vol 3735. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563983_16

Download citation

DOI: https://doi.org/10.1007/11563983_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29230-2
Online ISBN: 978-3-540-31698-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics