Computing q-Gram Non-overlapping Frequencies on SLP Compressed Texts

Goto, Keisuke; Bannai, Hideo; Inenaga, Shunsuke; Takeda, Masayuki

doi:10.1007/978-3-642-27660-6_25

Keisuke Goto²¹,
Hideo Bannai²¹,
Shunsuke Inenaga²¹ &
…
Masayuki Takeda²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7147))

Included in the following conference series:

International Conference on Current Trends in Theory and Practice of Computer Science

2183 Accesses

Abstract

Length-q substrings, or q-grams, can represent important characteristics of text data, and determining the frequencies of all q-grams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the non-overlapping frequencies of all q-grams in a text given in compressed form, namely, as a straight line program (SLP). We show that the problem can be solved in O(q ² n) time and O(qn) space where n is the size of the SLP. This generalizes and greatly improves previous work (Inenaga & Bannai, 2009) which solved the problem only for q = 2 in O(n ⁴logn) time and O(n ³) space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amir, A., Benson, G.: Efficient two-dimensional compressed matching. In: Proc. DCC 1992, pp. 279–288 (1992)
Google Scholar
Apostolico, A., Preparata, F.P.: Data structures and algorithms for the string statistics problem. Algorithmica 15(5), 481–494 (1996)
Article MathSciNet MATH Google Scholar
Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings. In: Proc. SODA 2011, pp. 373–389 (2011)
Google Scholar
Brodal, G.S., Lyngsø, R.B., Östlin, A., Pedersen, C.N.S.: Solving the String Statistics Problem in Time O(n logn). In: Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo, R. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 728–739. Springer, Heidelberg (2002)
Chapter Google Scholar
Goto, K., Bannai, H., Inenaga, S., Takeda, M.: Towards efficient mining and classification on compressed strings. In: Accepted for SPIRE 2011 (2011), preprint available at arXiv:1103.3114v2
Google Scholar
Hermelin, D., Landau, G.M., Landau, S., Weimann, O.: A unified algorithm for accelerating edit-distance computation via text-compression. In: Proc. STACS 2009, pp. 529–540 (2009)
Google Scholar
Inenaga, S., Bannai, H.: Finding characteristic substring from compressed texts. In: Proc. The Prague Stringology Conference 2009, pp. 40–54 (2009); full version to appear in the International Journal of Foundations of Computer Science
Google Scholar
Karpinski, M., Rytter, W., Shinohara, A.: An efficient pattern-matching algorithm for strings with short descriptions. Nordic Journal of Computing 4, 172–186 (1997)
MathSciNet MATH Google Scholar
Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM Journal on Computing 6(2), 323–350 (1977)
Article MathSciNet MATH Google Scholar
Larsson, N.J., Moffat, A.: Off-line dictionary-based compression. Proceedings of the IEEE 88(11), 1722–1732 (2000)
Article Google Scholar
Lifshits, Y.: Processing Compressed Texts: A Tractability Border. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 228–240. Springer, Heidelberg (2007)
Chapter Google Scholar
Matsubara, W., Inenaga, S., Ishino, A., Shinohara, A., Nakamura, T., Hashimoto, K.: Efficient algorithms to compute compressed longest common substrings and compressed palindromes. Theoretical Computer Science 410(8-10), 900–913 (2009)
Article MathSciNet MATH Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), 2 (2007)
Article MATH Google Scholar
Nevill-Manning, C.G., Witten, I.H., Maulsby, D.L.: Compression by induction of hierarchical grammars. In: Proc. DCC 1994, pp. 244–253 (1994)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory IT-23(3), 337–349 (1977)
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable-length coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Kyushu University, 744 Motooka, Nishiku, Fukuoka, 819–0395, Japan
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga & Masayuki Takeda

Authors

Keisuke Goto
View author publications
You can also search for this author in PubMed Google Scholar
Hideo Bannai
View author publications
You can also search for this author in PubMed Google Scholar
Shunsuke Inenaga
View author publications
You can also search for this author in PubMed Google Scholar
Masayuki Takeda
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics and Information Technologies, Institute of Informatics and Software Engineering, Slovak University of Technology in Bratislava, Ilkovičova 3, 842 16, Bratislava 4, Slovakia
Mária Bieliková
Dept. of Intelligent Systems and Business Informatics, Alpen-Adria-Universität Klagenfurt, Universitätsstr. 65-57, 9020, Klagenfurt, Austria
Gerhard Friedrich
Department of Computer Science, University of Oxford, UK
Georg Gottlob
Security Engineering Group, Technische Universität Darmstadt, Hochschulstr. 10, 64289, Darmstadt, Germany
Stefan Katzenbeisser
University of Illinois at Chicago, Dept. of Math., Stat. and Comp. Sci, 851 S. Morgan Street, Chicago, IL 60607-7045, USA; and University of Szeged, Research Group on Artificial Intelligence of the Hungarian Academy of Sciences, 6701 Szeged, Postafiók, Hungary
György Turán

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Goto, K., Bannai, H., Inenaga, S., Takeda, M. (2012). Computing q-Gram Non-overlapping Frequencies on SLP Compressed Texts. In: Bieliková, M., Friedrich, G., Gottlob, G., Katzenbeisser, S., Turán, G. (eds) SOFSEM 2012: Theory and Practice of Computer Science. SOFSEM 2012. Lecture Notes in Computer Science, vol 7147. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27660-6_25

Download citation

DOI: https://doi.org/10.1007/978-3-642-27660-6_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27659-0
Online ISBN: 978-3-642-27660-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics