Language Independent Extraction of Key Terms: An Extensive Comparison of Metrics

  • Luís F. S. Teixeira
  • Gabriel P. Lopes
  • Rita A. Ribeiro
Part of the Communications in Computer and Information Science book series (CCIS, volume 358)


In this paper twenty language independent statistically-based metrics used for key term extraction from any document collection are compared. Some of those metrics are widely used for this purpose. The others were recently created. Two different document representations are considered in our experiments. One is based on words and multi-words and the other is based on word prefixes of fixed length (5 characters for the experiments made). Prefixes were used for studying how morphologically rich languages, namely Portuguese and Czech behave when applying this other kind of representation. English is also studied taking it, as a non-morphologically rich language. Results are manually evaluated and agreement between evaluators is assessed using k-Statistics. The metrics based on Tf-Idf and Phi-square proved to have higher precision and recall. The use of prefix-based representation of documents enabled a significant precision improvement for documents written in Portuguese. For Czech, recall also improved.


Document keywords Document topics Words Multi-words Prefixes Automatic extraction Suffix arrays 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    da Silva, J.F., Lopes, G.P.: A Document Descriptor Extractor Based on Relevant Expressions. In: Lopes, L.S., Lau, N., Mariano, P., Rocha, L.M. (eds.) EPIA 2009. LNCS (LNAI), vol. 5816, pp. 646–657. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  2. 2.
    da Silva, J.F., Lopes, G.P.: Towards Automatic Building of Document Keywords. In: COLING 2010 - The 23rd International Conference on Computational Linguistics, Poster Volume, Pequim, pp. 1149–1157 (2010)Google Scholar
  3. 3.
    Teixeira, L., Lopes, G., Ribeiro, R.A.: Automatic Extraction of Document Topics. In: Camarinha-Matos, L.M. (ed.) DoCEIS 2011. IFIP AICT, vol. 349, pp. 101–108. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  4. 4.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  5. 5.
    da Silva, J.F., Lopes, G.P.: A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. In: Proceedings of the 6th Meeting on the Mathematics of Language, Orlando, pp. 369–381 (1999)Google Scholar
  6. 6.
    Jacquemin, C.: Spotting and discovering terms through natural language processing. MIT Press (2001)Google Scholar
  7. 7.
    Hulth, A.: Improved Automatic Keyword Extraction Given More Linguistic Knowledge. In: EMNLP 2003 Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 216–223. Association for Computational Linguistics, Stroudsburg (2003)CrossRefGoogle Scholar
  8. 8.
    Ngonga Ngomo, A.-C.: Knowledge-Free Discovery of Domain-Specific Multiword Units. In: Proceedings of the 2008 ACM Symposium on Applied Computing, SAC 2008, pp. 1561–1565. ACM, Fortaleza (2008), doi:
  9. 9.
    Martínez-Fernández, J.L., García-Serrano, A., Martínez, P., Villena, J.: Automatic Keyword Extraction for News Finder. In: Nürnberger, A., Detyniecki, M. (eds.) AMR 2003. LNCS, vol. 3094, pp. 99–119. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  10. 10.
    Cigarrán, J.M., Peñas, A., Gonzalo, J., Verdejo, F.: Automatic Selection of Noun Phrases as Document Descriptors in an FCA-Based Information Retrieval System. In: Ganter, B., Godin, R. (eds.) ICFCA 2005. LNCS (LNAI), vol. 3403, pp. 49–63. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  11. 11.
    Liu, F., Pennell, D., Liu, F., Liu, Y.: Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pp. 620–628. Association for Computational Linguistics, Boulder (2009)Google Scholar
  12. 12.
    Katja, H., Manos, T., Edgar, M., Maarten, de R.: The impact of document structure on keyphrase extraction. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp. 1725–1728. ACM, Hong Kong (2009)Google Scholar
  13. 13.
    Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 404–411 (2004)Google Scholar
  14. 14.
    Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Inf. Retr. 2(4), 303–336 (2000), doi:10.1023/a:1009976227802CrossRefGoogle Scholar
  15. 15.
    Lemnitzer, L., Monachesi, P.: Extraction and evaluation of keywords from Learning Objects - a multilingual approach. In: Proceedings of the Language Resources and Evaluation Conference (2008)Google Scholar
  16. 16.
    Matsuo, Y., Ishizuka, M.: Keyword Extraction from a single Document using word Co-Occurence Statistical Information. International Journal on Articial Intelligence Tools 13(1), 157–169 (2004)CrossRefGoogle Scholar
  17. 17.
    da Silva, J. F., Dias, G., Guilloré, S., Lopes, J.G. P.: Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. In: Barahona, P., Alferes, J.J. (eds.) EPIA 1999. LNCS (LNAI), vol. 1695, pp. 113–132. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  18. 18.
    Gomes, L.: Multi-Word Extractor (2009),
  19. 19.
    Douglas McIlroy, M.: Suffix arrays (2007),
  20. 20.
    Yamamoto, M., Church, K.W.: Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus. Computational Linguistics 27(1), 1–30 (2001)CrossRefGoogle Scholar
  21. 21.
    Everitt, B.S.: The Cambridge Dictionary of Statistics, 2nd edn. Cambridge University Press, New York (2002)zbMATHGoogle Scholar
  22. 22.
    Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefGoogle Scholar
  23. 23.
    Goldsmith, J.: Unsupervised learning of the morphology of a natural language. Computational Linguistiscs 27(2), 153–198 (2001)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. 4(1), 1–34 (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Luís F. S. Teixeira
    • 2
  • Gabriel P. Lopes
    • 2
  • Rita A. Ribeiro
    • 1
  1. 1.CA3-Uninova, Campus FCT/UNLCaparicaPortugal
  2. 2.CITI, Dep. InformáticaFCT/UNLCaparicaPortugal

Personalised recommendations