Automatic Extraction of Document Topics

Teixeira, Luís; Lopes, Gabriel; Ribeiro, Rita A.

doi:10.1007/978-3-642-19170-1_11

Luís Teixeira²,
Gabriel Lopes³ &
Rita A. Ribeiro²

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 349))

Included in the following conference series:

Doctoral Conference on Computing, Electrical and Industrial Systems

2514 Accesses
4 Citations

Abstract

A keyword or topic for a document is a word or multi-word (sequence of 2 or more words) that summarizes in itself part of that document content. In this paper we compare several statistics-based language independent methodologies to automatically extract keywords. We rank words, multi-words, and word prefixes (with fixed length: 5 characters), by using several similarity measures (some widely known and some newly coined) and evaluate the results obtained as well as the agreement between evaluators. Portuguese, English and Czech were the languages experimented.

Download to read the full chapter text

Chapter PDF

Semantic Measures for Keywords Extraction

Language Independent Extraction of Key Terms: An Extensive Comparison of Metrics

Keyword Extraction: A Modern Perspective

Article Open access 15 December 2022

Keywords

References

da Silva, J.F., Lopes, G.P.: A Document Descriptor Extractor Based on Relevant Expressions. In: Lopes, L.S., Lau, N., Mariano, P., Rocha, L.M. (eds.) EPIA 2009. LNCS, vol. 5816, pp. 646–657. Springer, Heidelberg (2009)
Chapter Google Scholar
Cigarrán, J.M., Peas, A., Gonzalo, J., Verdejo, F.: Automatic selection of noun phrases as document descriptors in an FCA-based information retrieval system. In: Ganter, B., Godin, R. (eds.) ICFCA 2005. LNCS (LNAI), vol. 3403, pp. 49–63. Springer, Heidelberg (2005)
Chapter Google Scholar
Liu, F., Pennell, D., Liu, F., Liu, Y.: Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics Boulder, Boulder, Colorado, May 31-June 05 (2009)
Google Scholar
Hulth, A.: Enhancing linguistically oriented automatic keyword extraction. In: Proceedings of Human Language Technology-North American Association for Computational Linguistics 2004 conference, May 02-07, pp. 17–20. Association for Computational Linguistics, Boston (2004)
Google Scholar
Yangarber, R., Grishman, R.: Machine Learning of Extraction Patterns from Unanotated Corpora: Position Statement. In: Workshop on Machine Learning for Information Extraction. Held in conjunction with the 14th European Conference on Artificial Intelligence (ECAI), August 21. Humboldt University, Berlin (2000)
Google Scholar
Christian, J.: Spotting and Discovering Terms through Natural Language Processing. MIT Press, Cambridge (2001)
Google Scholar
Martínez-Fernández, J.L., García-Serrano, A., Martínez, P., Villena, J.: Automatic Keyword Extraction for News Finder. In: Nürnberger, A., Detyniecki, M. (eds.) AMR 2003. LNCS (LNAI), vol. 3094, pp. 99–119. Springer, Heidelberg (2004)
Chapter Google Scholar
Ercan, G., Cicekli, I.: Using lexical chains for keyword extraction. Information Processing and Management: an International Journal archive 43(6), 1705–1714 (2007)
Article Google Scholar
Miller, G.A.: The science of words. Scientific American Library, New York (1991)
Google Scholar
de Silva, J.F., Lopes, G.P.: Towards Automatic Building of Document Keywords. In: The 23rd International Conference on Computational Linguistics, COLING 2010, Pequim (2010)
Google Scholar
de Silva, J.F., Dias, G., Guilloré, S., et al.: Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. In: 9th Portuguese Conference on Artificial Intelligence Evora, September 21-24 (1999)
Google Scholar
Everitt, B.S.: The Cambridge Dictionary of Statistics, CUP (2002)
Google Scholar
Multi-Word Extractor, http://hlt.di.fct.unl.pt/luis/multiwords/index.html
Suffix arrays, http://www.cs.dartmouth.edu/~doug/sarray/
Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

CA3-Uninova, FCT, Universidade Nova de Lisboa, 2829-516, Caparica, Portugal
Luís Teixeira & Rita A. Ribeiro
DI-FCT/UNL, 2829-516, Caparica, Portugal
Gabriel Lopes

Authors

Luís Teixeira
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Lopes
View author publications
You can also search for this author in PubMed Google Scholar
Rita A. Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Sciences and Technology, New University of Lisbon, Campus de Caparica, 2829-516, Monte, Caparica, Portugal
Luis M. Camarinha-Matos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Teixeira, L., Lopes, G., Ribeiro, R.A. (2011). Automatic Extraction of Document Topics. In: Camarinha-Matos, L.M. (eds) Technological Innovation for Sustainability. DoCEIS 2011. IFIP Advances in Information and Communication Technology, vol 349. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19170-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-19170-1_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19169-5
Online ISBN: 978-3-642-19170-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic Extraction of Document Topics

Abstract

Chapter PDF

Similar content being viewed by others

Semantic Measures for Keywords Extraction

Language Independent Extraction of Key Terms: An Extensive Comparison of Metrics

Keyword Extraction: A Modern Perspective

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Automatic Extraction of Document Topics

Abstract

Chapter PDF

Similar content being viewed by others

Semantic Measures for Keywords Extraction

Language Independent Extraction of Key Terms: An Extensive Comparison of Metrics

Keyword Extraction: A Modern Perspective

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation