Abstract
Given a textstring x of n symbols and an integer constant d, we consider the problem of finding, for any pair (y,z) of subwords of x the number of times that y and z occur in tandem (i.e., with no intermediate occurrence of either one of them) within a distance of d symbols of x. Although in principle there might be n 4 distinct subword pairs in x, we show that it suffices to consider a family of only n 2 such pairs, with the property that for any neglected pair (y′,z′), there is a corresponding pair (y,z) contained in our family and such that: (i) y′ is a prefix of y and z′ is a prefix of z, and (ii) the tandem index of (y′,z′) equals that of (y,z). We show that an algorithm for the construction of the table of all such tandem indices can be built to run in optimal O(n 2 ) time and space.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases. In: Proc. ACM SIGMOD, Washington DC, May 1993, pp. 207–216 (1993)
Aho, A.V., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading (1974)
Apostolico, A., Galil, Z. (eds.): Pattern Matching Algorithms. Oxford University Press, New York (1997)
Arimura, H., Arikawa, S.: Efficient Discovery of Optimal Word-Association Patterns in Large Text Databases. New Generation Computing 18, 49–60 (2000)
Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., Chen, M.T., Seiferas, J.: The Smallest Automaton Recognizing the Subwords of a Text. Theoretical Computer Science 40, 31–55 (1985)
Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging, Computational Linguistics (1995)
Crochemore, M., Rytter, W.: Text Algorithms. Oxford University Press, New York (1994)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2000)
Karlsson, F., Voutilainen, A., Heikkilä, F., Anttila, A.: Constraint Grammar. A Language Independent System for Parsing Unrestricted Text. Mouton de Gruyter (1995)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)
McCreight, E.M.: A Space-Economical Suffix Tree Construction Algorithm. Journal of the ACM 23(2), 262–272 (1976)
Na, J.C., Apostolico, A., Iliopoulos, C.S., Park, K.: Truncated Suffix Trees and their Application to Data Compression. Theoretical Computer Science 304(1-3), 87–101 (2003)
Piatesky-Shapiro, G., Frawley, W.J. (eds.): Knowledge Discovery in Databases. AAAI Press/MIT Press (1991)
Schieber, B., Vishkin, U.: On Finding Lowest Common Ancestors: Simplifications and Parallelizations. SIAM Journal on Computing 17, 1253–1262 (1988)
Ukkonen, E.: On-line Construction of Suffix Trees. Algorithmica 14(3), 249–260 (1995)
van Helden, J., Rios, A.F., Collado-Vides, J.: Discovering Regulatory Elements in Non-coding Sequences by Analysis of Spaced Dyads. Nucleic Acid Research 28(8), 1808–1818 (2000)
Wang, J.T.-L., Chirn, G.-W., Marr, T.G., Shapiro, B., Shasha, D., Zhang, K.: Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results. In: Proceedings of 1994 SIGMOD, pp. 115–125 (1994)
Weiner, P.: Linear Pattern Matching algorithm. In: Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, Washington, DC, pp. 1–11 (1973)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Apostolico, A., Pizzi, C., Satta, G. (2004). Optimal Discovery of Subword Associations in Strings. In: Suzuki, E., Arikawa, S. (eds) Discovery Science. DS 2004. Lecture Notes in Computer Science(), vol 3245. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30214-8_21
Download citation
DOI: https://doi.org/10.1007/978-3-540-30214-8_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23357-2
Online ISBN: 978-3-540-30214-8
eBook Packages: Springer Book Archive