Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation

Dubov, Mikhail

doi:10.1007/978-3-319-26123-2_30

Mikhail Dubov¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 542))

Included in the following conference series:

International Conference on Analysis of Images, Social Networks and Texts

1040 Accesses
2 Citations

Abstract

We present an improved implementation of the Annotated suffix tree method for text analysis (abbreviated as the AST-method). Annotated suffix trees are an extension of the original suffix tree data structure, with nodes labeled by occurrence frequencies for corresponding substrings in the input text collection. They have a range of interesting applications in text analysis, such as language-independent computation of a matching score for a keyphrase against some text collection. In our enhanced implementation, new algorithms and data structures (suffix arrays used instead of the traditional but heavyweight suffix trees) have enabled us to derive an implementation superior to the previous ones in terms of both memory consumption (10 times less memory) and runtime. We describe an open-source statistical text analysis software package, called “EAST”, which implements this enhanced annotated suffix tree method. Besides, the EAST package includes an adaptation of a distributional synonym extraction algorithm that supports the Russian language and allows us to achieve better results in keyphrase matching.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2, 53–86 (2004)
Article MathSciNet MATH Google Scholar
Barsky, M., Stege, U., Thomo, A.: A survey of practical algorithms for suffix tree construction in external memory. Softw. Pract. Experience 40(11), 965–988 (2010)
Article Google Scholar
Dubov, M., Chernyak, E.: Annotated suffix trees: implementation details. Transactions of Scientific Conference on Analysis of Images, Social Networks and Texts (AIST), pp. 49–57. Springer, Switzerland (2013)
Google Scholar
Dubov, M., Mirkin, B., Shal, A.: Automatic russian text processing system. Open Systems DBMS 22(10), 15–17 (2014)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Kinen, J, Sanders, P.: Simple Linear Work Suffix Array Construction. Automata, Languages and Programming. Lecture Notes in Computer Science, pp. 943–2719 (2003)
Google Scholar
Kasai, T., Lee, G.H., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
Chapter Google Scholar
Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, pp. 768–774 (1998)
Google Scholar
Manber, U.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
Mirkin, B., Chernyak, E., Chugunova, O.: Method of annotated suffix tree for scoring the extent of presence of a string in text. Bus. Inf. 3(21), 31–41 (2012)
Google Scholar
Pampapathi, R.: Annotated suffix trees for text modelling and classification. Doctoral dissertation, Birkbeck College, University of London, Retrieved from CiteSeerX (2008)
Google Scholar
Pampapathi, R., Mirkin, B., Levene, M.: A suffix tree approach to anti-spam email filtering. Mach. Learn. 65(1), 309–338 (2006)
Article Google Scholar
Perkins, J.: Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, Birmingham (2010)
Google Scholar
Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)
Article MathSciNet MATH Google Scholar
Ukkonen, E.: On-Line Construction of Suffix Trees. Algorithmica 14(3), 249–260 (1995)
Article MathSciNet MATH Google Scholar
Wang, T.: Extracting Synonyms from Dictionary Definitions. Retrieved from Focus on Research, Master dissertation, University of Toronto (2009)
Google Scholar

Download references

Acknowledgments

This research carried out in 2015 was supported by “The National Research University ‘Higher School of Economics’ Academic Fund Program” grant ( 15-05-0041). The financial support from the Government of the Russian Federation within the framework of the implementation of the 5–100 Programme Roadmap of the National Research University – Higher School of Economics is acknowledged.

Author information

Authors and Affiliations

Computer Science Faculty, National Research University Higher School of Economics, Moscow, Russia
Mikhail Dubov

Authors

Mikhail Dubov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mikhail Dubov .

Editor information

Editors and Affiliations

Krasovsky Institute of Mathematics and Mechanics, Yekaterinburg, Russia
Mikhail Yu. Khachay
Wolverhampton, United Kingdom
Natalia Konstantinova
Technische Universität Darmstadt, Darmstadt, Germany
Alexander Panchenko
National Research University Higher School of Economics, Moscow, Russia
Dmitry Ignatov
Ural Federal University, Yekaterinbug, Russia
Valeri G. Labunets

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dubov, M. (2015). Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation. In: Khachay, M., Konstantinova, N., Panchenko, A., Ignatov, D., Labunets, V. (eds) Analysis of Images, Social Networks and Texts. AIST 2015. Communications in Computer and Information Science, vol 542. Springer, Cham. https://doi.org/10.1007/978-3-319-26123-2_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-26123-2_30
Published: 05 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26122-5
Online ISBN: 978-3-319-26123-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics