Abstract
A keyword or topic for a document is a word or multi-word (sequence of 2 or more words) that summarizes in itself part of that document content. In this paper we compare several statistics-based language independent methodologies to automatically extract keywords. We rank words, multi-words, and word prefixes (with fixed length: 5 characters), by using several similarity measures (some widely known and some newly coined) and evaluate the results obtained as well as the agreement between evaluators. Portuguese, English and Czech were the languages experimented.
Chapter PDF
Similar content being viewed by others
References
da Silva, J.F., Lopes, G.P.: A Document Descriptor Extractor Based on Relevant Expressions. In: Lopes, L.S., Lau, N., Mariano, P., Rocha, L.M. (eds.) EPIA 2009. LNCS, vol. 5816, pp. 646–657. Springer, Heidelberg (2009)
Cigarrán, J.M., Peas, A., Gonzalo, J., Verdejo, F.: Automatic selection of noun phrases as document descriptors in an FCA-based information retrieval system. In: Ganter, B., Godin, R. (eds.) ICFCA 2005. LNCS (LNAI), vol. 3403, pp. 49–63. Springer, Heidelberg (2005)
Liu, F., Pennell, D., Liu, F., Liu, Y.: Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics Boulder, Boulder, Colorado, May 31-June 05 (2009)
Hulth, A.: Enhancing linguistically oriented automatic keyword extraction. In: Proceedings of Human Language Technology-North American Association for Computational Linguistics 2004 conference, May 02-07, pp. 17–20. Association for Computational Linguistics, Boston (2004)
Yangarber, R., Grishman, R.: Machine Learning of Extraction Patterns from Unanotated Corpora: Position Statement. In: Workshop on Machine Learning for Information Extraction. Held in conjunction with the 14th European Conference on Artificial Intelligence (ECAI), August 21. Humboldt University, Berlin (2000)
Christian, J.: Spotting and Discovering Terms through Natural Language Processing. MIT Press, Cambridge (2001)
Martínez-Fernández, J.L., García-Serrano, A., Martínez, P., Villena, J.: Automatic Keyword Extraction for News Finder. In: Nürnberger, A., Detyniecki, M. (eds.) AMR 2003. LNCS (LNAI), vol. 3094, pp. 99–119. Springer, Heidelberg (2004)
Ercan, G., Cicekli, I.: Using lexical chains for keyword extraction. Information Processing and Management: an International Journal archive 43(6), 1705–1714 (2007)
Miller, G.A.: The science of words. Scientific American Library, New York (1991)
de Silva, J.F., Lopes, G.P.: Towards Automatic Building of Document Keywords. In: The 23rd International Conference on Computational Linguistics, COLING 2010, Pequim (2010)
de Silva, J.F., Dias, G., Guilloré, S., et al.: Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. In: 9th Portuguese Conference on Artificial Intelligence Evora, September 21-24 (1999)
Everitt, B.S.: The Cambridge Dictionary of Statistics, CUP (2002)
Multi-Word Extractor, http://hlt.di.fct.unl.pt/luis/multiwords/index.html
Suffix arrays, http://www.cs.dartmouth.edu/~doug/sarray/
Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 IFIP International Federation for Information Processing
About this paper
Cite this paper
Teixeira, L., Lopes, G., Ribeiro, R.A. (2011). Automatic Extraction of Document Topics. In: Camarinha-Matos, L.M. (eds) Technological Innovation for Sustainability. DoCEIS 2011. IFIP Advances in Information and Communication Technology, vol 349. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19170-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-19170-1_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19169-5
Online ISBN: 978-3-642-19170-1
eBook Packages: Computer ScienceComputer Science (R0)