Skip to main content

Optimal Discovery of Subword Associations in Strings

  • Conference paper
Discovery Science (DS 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3245))

Included in the following conference series:

Abstract

Given a textstring x of n symbols and an integer constant d, we consider the problem of finding, for any pair (y,z) of subwords of x the number of times that y and z occur in tandem (i.e., with no intermediate occurrence of either one of them) within a distance of d symbols of x. Although in principle there might be n 4 distinct subword pairs in x, we show that it suffices to consider a family of only n 2 such pairs, with the property that for any neglected pair (y′,z′), there is a corresponding pair (y,z) contained in our family and such that: (i) y′ is a prefix of y and z′ is a prefix of z, and (ii) the tandem index of (y′,z′) equals that of (y,z). We show that an algorithm for the construction of the table of all such tandem indices can be built to run in optimal O(n 2 ) time and space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases. In: Proc. ACM SIGMOD, Washington DC, May 1993, pp. 207–216 (1993)

    Google Scholar 

  2. Aho, A.V., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading (1974)

    MATH  Google Scholar 

  3. Apostolico, A., Galil, Z. (eds.): Pattern Matching Algorithms. Oxford University Press, New York (1997)

    MATH  Google Scholar 

  4. Arimura, H., Arikawa, S.: Efficient Discovery of Optimal Word-Association Patterns in Large Text Databases. New Generation Computing 18, 49–60 (2000)

    Article  Google Scholar 

  5. Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., Chen, M.T., Seiferas, J.: The Smallest Automaton Recognizing the Subwords of a Text. Theoretical Computer Science 40, 31–55 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  6. Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging, Computational Linguistics (1995)

    Google Scholar 

  7. Crochemore, M., Rytter, W.: Text Algorithms. Oxford University Press, New York (1994)

    MATH  Google Scholar 

  8. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2000)

    Google Scholar 

  9. Karlsson, F., Voutilainen, A., Heikkilä, F., Anttila, A.: Constraint Grammar. A Language Independent System for Parsing Unrestricted Text. Mouton de Gruyter (1995)

    Google Scholar 

  10. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  11. Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)

    Google Scholar 

  12. McCreight, E.M.: A Space-Economical Suffix Tree Construction Algorithm. Journal of the ACM 23(2), 262–272 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  13. Na, J.C., Apostolico, A., Iliopoulos, C.S., Park, K.: Truncated Suffix Trees and their Application to Data Compression. Theoretical Computer Science 304(1-3), 87–101 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  14. Piatesky-Shapiro, G., Frawley, W.J. (eds.): Knowledge Discovery in Databases. AAAI Press/MIT Press (1991)

    Google Scholar 

  15. Schieber, B., Vishkin, U.: On Finding Lowest Common Ancestors: Simplifications and Parallelizations. SIAM Journal on Computing 17, 1253–1262 (1988)

    Article  MATH  MathSciNet  Google Scholar 

  16. Ukkonen, E.: On-line Construction of Suffix Trees. Algorithmica 14(3), 249–260 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  17. van Helden, J., Rios, A.F., Collado-Vides, J.: Discovering Regulatory Elements in Non-coding Sequences by Analysis of Spaced Dyads. Nucleic Acid Research 28(8), 1808–1818 (2000)

    Article  Google Scholar 

  18. Wang, J.T.-L., Chirn, G.-W., Marr, T.G., Shapiro, B., Shasha, D., Zhang, K.: Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results. In: Proceedings of 1994 SIGMOD, pp. 115–125 (1994)

    Google Scholar 

  19. Weiner, P.: Linear Pattern Matching algorithm. In: Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, Washington, DC, pp. 1–11 (1973)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Apostolico, A., Pizzi, C., Satta, G. (2004). Optimal Discovery of Subword Associations in Strings. In: Suzuki, E., Arikawa, S. (eds) Discovery Science. DS 2004. Lecture Notes in Computer Science(), vol 3245. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30214-8_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30214-8_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23357-2

  • Online ISBN: 978-3-540-30214-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics