AST Method for Scoring String-to-text Similarity

  • Ekaterina ChernyakEmail author
  • Boris Mirkin
Part of the Springer Optimization and Its Applications book series (SOIA, volume 92)


A suffix-tree-based method for measuring similarity of a key phrase to an unstructured text is proposed. The measure involves less computation and it does not depend on the length of the text or the key phrase. This applies to:
  1. 1.

    finding interrelations between key phrases over a set of texts;

  2. 2.

    annotating a research article by topics from a taxonomy of the domain;

  3. 3.

    clustering relevant topics and mapping clusters on a domain taxonomy.



Suffix tree Unstructured text analysis String similarity measures 


  1. 1.
    ACM Computing Classification System. (1998)
  2. 2.
    Chernyak, E., Chugunova, O., Askarova, J., Nascimento, S., Mirkin, B.: Abstracting concepts from text documents by using an ontology. In: 1st International Workshop on Concept Discovery in Unstructured Data, pp. 20–30. University Higher School of Economics, Moscow (2011)Google Scholar
  3. 3.
    Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)CrossRefzbMATHMathSciNetGoogle Scholar
  4. 4.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)CrossRefzbMATHGoogle Scholar
  5. 5.
    Mirkin, B.: Clustering for Data Mining: A Data Recovery Approach. Chapman and Hall/CRC, Boca Raton (2005)CrossRefGoogle Scholar
  6. 6.
    Mirkin, B., Fenner, T., Nascimento, S., Pereira, L.M.: A Hybrid cluster-lift method for the analysis of research activities. Lect. Notes Comput. Sci. 6076(1), 152–161 (2010)CrossRefGoogle Scholar
  7. 7.
    Mirkin, B., Nascimento, S., Fenner, T., Pereira, L.M.: Fuzzy thematic clusters mapped to higher ranks in a taxonomy. Int. J. Softw. Inform. 4(3), 257–275 (2010)Google Scholar
  8. 8.
    Nikol’skaya, I.Y., Yefremenkova, V.M.: Mathematics in VINITI RAS: from abstract journal to databases. Sci. Tech. Inf. Process. 35(3), 128–138 (2008) (in Russian)CrossRefGoogle Scholar
  9. 9.
    Pampapathi, R., Mirkin, B., Levene, M.: A suffix tree approach to anti-spam email filtering. Mach. Learn. 65(1), 309–338 (2006)CrossRefGoogle Scholar
  10. 10.
    Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and beyond. J. Found. Trends Inf. Retr. 3(4), 333–369 (2009)CrossRefGoogle Scholar
  11. 11.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM. 18(11), 613–620 (1975)CrossRefzbMATHGoogle Scholar
  12. 12.
    Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: Proceedings of SIGIR’98, pp. 46–54. University of Washington, Seattle (1998)Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.School of Applied Mathematics and Information ScienceNational Research University – Higher School of EconomicsMoscowRussia

Personalised recommendations