Language Independent Extraction of Key Terms: An Extensive Comparison of Metrics
In this paper twenty language independent statistically-based metrics used for key term extraction from any document collection are compared. Some of those metrics are widely used for this purpose. The others were recently created. Two different document representations are considered in our experiments. One is based on words and multi-words and the other is based on word prefixes of fixed length (5 characters for the experiments made). Prefixes were used for studying how morphologically rich languages, namely Portuguese and Czech behave when applying this other kind of representation. English is also studied taking it, as a non-morphologically rich language. Results are manually evaluated and agreement between evaluators is assessed using k-Statistics. The metrics based on Tf-Idf and Phi-square proved to have higher precision and recall. The use of prefix-based representation of documents enabled a significant precision improvement for documents written in Portuguese. For Czech, recall also improved.
KeywordsDocument keywords Document topics Words Multi-words Prefixes Automatic extraction Suffix arrays
Unable to display preview. Download preview PDF.
- 2.da Silva, J.F., Lopes, G.P.: Towards Automatic Building of Document Keywords. In: COLING 2010 - The 23rd International Conference on Computational Linguistics, Poster Volume, Pequim, pp. 1149–1157 (2010)Google Scholar
- 5.da Silva, J.F., Lopes, G.P.: A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. In: Proceedings of the 6th Meeting on the Mathematics of Language, Orlando, pp. 369–381 (1999)Google Scholar
- 6.Jacquemin, C.: Spotting and discovering terms through natural language processing. MIT Press (2001)Google Scholar
- 8.Ngonga Ngomo, A.-C.: Knowledge-Free Discovery of Domain-Specific Multiword Units. In: Proceedings of the 2008 ACM Symposium on Applied Computing, SAC 2008, pp. 1561–1565. ACM, Fortaleza (2008), doi:http://doi.acm.org/10.1145/1363686.1364053
- 11.Liu, F., Pennell, D., Liu, F., Liu, Y.: Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pp. 620–628. Association for Computational Linguistics, Boulder (2009)Google Scholar
- 12.Katja, H., Manos, T., Edgar, M., Maarten, de R.: The impact of document structure on keyphrase extraction. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp. 1725–1728. ACM, Hong Kong (2009)Google Scholar
- 13.Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 404–411 (2004)Google Scholar
- 15.Lemnitzer, L., Monachesi, P.: Extraction and evaluation of keywords from Learning Objects - a multilingual approach. In: Proceedings of the Language Resources and Evaluation Conference (2008)Google Scholar
- 18.Gomes, L.: Multi-Word Extractor (2009), http://hlt.di.fct.unl.pt/luis/multiwords/index.html
- 19.Douglas McIlroy, M.: Suffix arrays (2007), http://www.cs.dartmouth.edu/~doug/sarray/