Optimal Discovery of Subword Associations in Strings

Apostolico, Alberto; Pizzi, Cinzia; Satta, Giorgio

doi:10.1007/978-3-540-30214-8_21

Alberto Apostolico^20,21,
Cinzia Pizzi²² &
Giorgio Satta²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3245))

Included in the following conference series:

International Conference on Discovery Science

877 Accesses
4 Citations

Abstract

Given a textstring x of n symbols and an integer constant d, we consider the problem of finding, for any pair (y,z) of subwords of x the number of times that y and z occur in tandem (i.e., with no intermediate occurrence of either one of them) within a distance of d symbols of x. Although in principle there might be n ⁴ distinct subword pairs in x, we show that it suffices to consider a family of only n ² such pairs, with the property that for any neglected pair (y′,z′), there is a corresponding pair (y,z) contained in our family and such that: (i) y′ is a prefix of y and z′ is a prefix of z, and (ii) the tandem index of (y′,z′) equals that of (y,z). We show that an algorithm for the construction of the table of all such tandem indices can be built to run in optimal O(n ² ) time and space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases. In: Proc. ACM SIGMOD, Washington DC, May 1993, pp. 207–216 (1993)
Google Scholar
Aho, A.V., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading (1974)
MATH Google Scholar
Apostolico, A., Galil, Z. (eds.): Pattern Matching Algorithms. Oxford University Press, New York (1997)
MATH Google Scholar
Arimura, H., Arikawa, S.: Efficient Discovery of Optimal Word-Association Patterns in Large Text Databases. New Generation Computing 18, 49–60 (2000)
Article Google Scholar
Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., Chen, M.T., Seiferas, J.: The Smallest Automaton Recognizing the Subwords of a Text. Theoretical Computer Science 40, 31–55 (1985)
Article MATH MathSciNet Google Scholar
Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging, Computational Linguistics (1995)
Google Scholar
Crochemore, M., Rytter, W.: Text Algorithms. Oxford University Press, New York (1994)
MATH Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2000)
Google Scholar
Karlsson, F., Voutilainen, A., Heikkilä, F., Anttila, A.: Constraint Grammar. A Language Independent System for Parsing Unrestricted Text. Mouton de Gruyter (1995)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)
Google Scholar
McCreight, E.M.: A Space-Economical Suffix Tree Construction Algorithm. Journal of the ACM 23(2), 262–272 (1976)
Article MATH MathSciNet Google Scholar
Na, J.C., Apostolico, A., Iliopoulos, C.S., Park, K.: Truncated Suffix Trees and their Application to Data Compression. Theoretical Computer Science 304(1-3), 87–101 (2003)
Article MATH MathSciNet Google Scholar
Piatesky-Shapiro, G., Frawley, W.J. (eds.): Knowledge Discovery in Databases. AAAI Press/MIT Press (1991)
Google Scholar
Schieber, B., Vishkin, U.: On Finding Lowest Common Ancestors: Simplifications and Parallelizations. SIAM Journal on Computing 17, 1253–1262 (1988)
Article MATH MathSciNet Google Scholar
Ukkonen, E.: On-line Construction of Suffix Trees. Algorithmica 14(3), 249–260 (1995)
Article MATH MathSciNet Google Scholar
van Helden, J., Rios, A.F., Collado-Vides, J.: Discovering Regulatory Elements in Non-coding Sequences by Analysis of Spaced Dyads. Nucleic Acid Research 28(8), 1808–1818 (2000)
Article Google Scholar
Wang, J.T.-L., Chirn, G.-W., Marr, T.G., Shapiro, B., Shasha, D., Zhang, K.: Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results. In: Proceedings of 1994 SIGMOD, pp. 115–125 (1994)
Google Scholar
Weiner, P.: Linear Pattern Matching algorithm. In: Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory, Washington, DC, pp. 1–11 (1973)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Ingegneria dell’ Informazione, Università di Padova, Padova, Italy
Alberto Apostolico
Department of Computer Sciences, Purdue University, Computer Sciences Building, West Lafayette, IN, 47907, USA
Alberto Apostolico
Dipartimento di Ingegneria dell’ Informazione, Università di Padova, Via Gradenigo 6/A, 35131, Padova, Italy
Cinzia Pizzi & Giorgio Satta

Authors

Alberto Apostolico
View author publications
You can also search for this author in PubMed Google Scholar
Cinzia Pizzi
View author publications
You can also search for this author in PubMed Google Scholar
Giorgio Satta
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics, Graduate School of Information Science and Electrical Engineering, Kyushu University, 744 Motooka, Nishi, 819-0395, Fukuoka, Japan
Einoshin Suzuki
Kyushu University, 6–10–1 Hakozaki Higashi-ku, 812–8581, Fukuoka, Japan
Setsuo Arikawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Apostolico, A., Pizzi, C., Satta, G. (2004). Optimal Discovery of Subword Associations in Strings. In: Suzuki, E., Arikawa, S. (eds) Discovery Science. DS 2004. Lecture Notes in Computer Science(), vol 3245. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30214-8_21

Download citation

DOI: https://doi.org/10.1007/978-3-540-30214-8_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23357-2
Online ISBN: 978-3-540-30214-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics