Skip to main content

Measuring Over-Generalization in the Minimal Multiple Generalizations of Biosequences

  • Conference paper
Discovery Science (DS 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3735))

Included in the following conference series:

Abstract

We consider the problem of finding a set of patterns that best characterizes a set of strings. To this end, Arimura et. al. [3] considered the use of minimal multiple generalizations (mmg) for such characterizations. Given any sample set, the mmgs are, roughly speaking, the most (syntactically) specific set of languages containing the sample within a given class of languages. Takae et. al. [17] found the mmgs of the class of pattern languages [1] which includes so-called sort symbols to be fairly accurate as predictors for signal peptides. We first reproduce their results using updated data. Then, by using a measure for estimating the level of over-generalizations made by the mmgs, we show results that explain the high level of accuracies resulting from the use of sort symbols, and discuss how better results can be obtained. The measure that we suggests here can also be applied to other types of patterns, e.g. the PROSITE patterns [4].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Angluin, D.: Finding patterns common to a set of strings. Journal of Computer and System Sciences 21, 46–62 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  2. Arimura, H., Fujino, R., Shinohara, T., Arikawa, S.: Protein motif discovery from positive examples by Minimal Multiple Generalization over regular patterns. In: Proceedings of the Genome Informatics Workshop, pp. 39–48 (1994)

    Google Scholar 

  3. Arimura, H., Shinohara, T., Otsuki, S.: Finding minimal generalizations for unions of pattern languages and its application to inductive inference from positive data. In: Enjalbert, P., Mayr, E.W., Wagner, K.W. (eds.) STACS 1994. LNCS, vol. 775, pp. 649–660. Springer, Heidelberg (1994)

    Google Scholar 

  4. Bairoch, A.: PROSITE: A dictionary of sites and patterns in proteins. Nucl. Acids Res. 25(19), 2241–2245 (1991)

    Google Scholar 

  5. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: Genbank: update. Nucl. Acids Res. 32(Database-Issue), 23–26 (2004)

    Article  Google Scholar 

  6. Brāzma, A., Jonassen, I., Eidhammer, I., Gilbert, D.: Approaches to the automatic discovery of patterns in biosequences. J. Comp. Biol. 5(2), 277–304 (1998)

    Google Scholar 

  7. Brejova, B., Vinar, T., Li, M.: Pattern Discovery: Methods and Software, Ch. 29, pp. 491–522. Humana Press (2003)

    Google Scholar 

  8. Case, J., Jain, S., Reischuk, R., Stephan, F., Zeugmann, T.: Learning a subclass of regular patterns in polynomial time. In: Gavaldá, R., Jantke, K.P., Takimoto, E. (eds.) ALT 2003. LNCS (LNAI), vol. 2842, pp. 234–246. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  9. Chan, C., Garofalakis, M., Rastogi, R.: RE-tree: an efficient index structure for regular expressions. The VLDB Journal 12(2), 102–119 (2003)

    Article  Google Scholar 

  10. Hopcroft, J., Ullman, J.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading (1979)

    MATH  Google Scholar 

  11. Kannan, S., Sweedyk, Z., Mahaney, S.: Counting and random generation of strings in regular languages. In: Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, pp. 551–557 (1995)

    Google Scholar 

  12. Ng, Y.K., Shinohara, T.: Inferring unions of the pattern languages by the most fitting covers. In: Jain, S., Simon, H.U., Tomita, E. (eds.) ALT 2005. LNCS (LNAI), vol. 3734, pp. 269–282. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  13. Ono, H., Ng, Y.K.: Best fitting fixed-length substring patterns for a set of strings. In: Proceedings of The Eleventh International Computing and Combinatorics Conference (COCOON 2005) (2005) (to appear)

    Google Scholar 

  14. Shinohara, A.: String pattern discovery. In: Ben-David, S., Case, J., Maruoka, A. (eds.) ALT 2004. LNCS (LNAI), vol. 3244, pp. 1–13. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  15. Shinohara, T.: Polynomial time inference of extended regular pattern languages. In: Goto, E., Nakajima, R., Yonezawa, A., Nakata, I., Furukawa, K. (eds.) RIMS 1982. LNCS, vol. 147, pp. 115–127. Springer, Heidelberg (1983)

    Google Scholar 

  16. Shinohara, T., Ng, Y.K.: Strong biases for the minimal multiple generalization algorithm on samples of very small sizes. In: The Proceedings of the 57th Meeting of SIG-FPAI, The Japanese Society of Artificial Intelligence (November 2004)

    Google Scholar 

  17. Takae, T., Kasai, T., Arimura, H., Shinohara, T.: Knowledge discovery in biosequences using sort regular patterns. In: Workshop on Applied Learning Theory (1998)

    Google Scholar 

  18. Uemura, J., Sato, M.: Compactness and learning of classes of unions of erasing regular pattern languages. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. (eds.) ALT 2002. LNCS (LNAI), vol. 2533, pp. 293–307. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  19. Yamaguchi, M., Shimozono, S., Shinohara, T.: Finding minimal multiple generalization over regular patterns with alphabet indexing. In: Proceedings of the Seventh Workshop on Genome Informatics, vol. 7, pp. 51–60. Universal Academy Press (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ng, Y.K., Ono, H., Shinohara, T. (2005). Measuring Over-Generalization in the Minimal Multiple Generalizations of Biosequences. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds) Discovery Science. DS 2005. Lecture Notes in Computer Science(), vol 3735. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563983_16

Download citation

  • DOI: https://doi.org/10.1007/11563983_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29230-2

  • Online ISBN: 978-3-540-31698-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics