Abstract
We present a framework for keyphrase extraction from scientific journals in diverse research fields. While journal articles are often provided with manually assigned keywords, it is not clear how to automatically extract keywords and measure their significance for a set of journal articles. We compare extracted keyphrases from journals in the fields of astrophysics, mathematics, physics, and computer science. We show that the presented statistics-based framework is able to demonstrate differences among journals, and that the extracted keyphrases can be used to represent journal or conference research topics, dynamics, and specificity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
See demo on-line: http://textmining.lt:8080/tex2txt.htm.
- 2.
In our case, this is mathematical language. Other cases may include a mix of English and French paragraphs in the same article.
- 3.
Function words are words that have little lexical meaning or have ambiguous meaning, but instead serve to express grammatical relationships with other words within a sentence (https://en.wikipedia.org/wiki/Function_word). For instance, and, or, the, and a are all function words.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
References
Baldwin, T., Lui, M.: Language identification: the long and the short of the matter. In: Human Language Technologies: The 2010 Annual Conference of the NAACL, Los Angeles, CA, pp. 229–237 (June 2010)
Bird, S., Dale, R., Dorr, B., Gibson, B., Joseph, M., Kan, M.Y., Lee, D., Powley, B., Radev, D., Tan, Y.F.: The ACL anthology reference corpus: a reference dataset for bibliographic research in computational linguistics. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2008), Marrakesh, Morocco, May 2008
Choueka, Y.: Looking for needles in a haystack, or locating interesting collocational expressions in large textual databases. In: Proceedings of the RIAO Conference on User-Oriented Content-Based Text and Image Handling, pp. 21–24. Cambridge, MA (1988)
Daudaravicius, V., Marcinkeviciene, R.: Gravity counts for the boundaries of collocations. Int. J. Corpus Linguist. 9(2), 321–348 (2004)
Daudaravicius, V.: The influence of collocation segmentation and top 10 items to keyword assignment performance. In: Gelbukh, A. (ed.) CICLing 2010. LNCS, vol. 6008, pp. 648–660. Springer, Heidelberg (2010). doi:10.1007/978-3-642-12116-6_55
Daudaravicius, V.: Applying collocation segmentation to the ACL anthology reference corpus. In: Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries, Jeju Island, Korea, pp. 66–75, July 2012
Daudaravicius, V.: Collocation segmentation for text chunking. Ph.D. thesis. Vytautas Magnus University, January 2013
Gollapalli, D.S., Caragea, C., Li, X., Giles, L.C.: Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction (2015)
Hasan, K.S., Ng, V.: Automatic keyphrase extraction: a survey of the state of the art. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 1262–1273, Baltimore, Maryland, June 2014
Kilgarriff, A., Rychly, P., Kovar, V., Baisa, V.: Finding multiwords of more than two words. In: Proceedings of the 15th EURALEX International Congress, Oslo, pp. 693–700 (2012)
Kim, N.S., Medelyan, O., Kan, M.Y., Baldwin, T.: SemEval-2010 task 5: automatic keyphrase extraction from scientific articles. In: Proceedings of the 5th International Workshop on Semantic Evaluation, pp. 21–26 (2010)
Lin, D.: Extracting collocations from text corpora. In: First Workshop on Computational Terminology, Montreal (1998)
Lopez, P., Romary, L.: HUMB: automatic key term extraction from scientific articles in GROBID. In: Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 248–251, July 2010
Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Documentation 60, 503–520 (2004)
Seretan, V.: Syntax-Based Collocation Extraction. Text, Speech and Language Technology, vol. 44. Springer, Netherlands (2011)
Smadja, F.: Retrieving collocations from text: Xtract. Comput. Linguist. 19, 143–177 (1993)
Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28, 11–21 (1972)
Strauss, U., Grzybek, P., Altmann, G.: Word length and word frequency. In: Grzybek, P. (ed.) Contributions to the Science of Text and Language: Word Length Studies and Related Issues, vol. 31, pp. 277–294. Springer, Netherlands (2006)
Tjong Kim Sang, E.F., Buchholz, S.: Introduction to the CoNLL-2000 shared task: Chunking. In: Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, pp. 127–132 (2000)
Turney, P.D.: Learning algorithms for keyphrase extraction. Inf. Retrieval 2(4), 303–336 (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Daudaravicius, V. (2016). A Framework for Keyphrase Extraction from Scientific Journals. In: González-Beltrán, A., Osborne, F., Peroni, S. (eds) Semantics, Analytics, Visualization. Enhancing Scholarly Data. SAVE-SD 2016. Lecture Notes in Computer Science(), vol 9792. Springer, Cham. https://doi.org/10.1007/978-3-319-53637-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-53637-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-53636-1
Online ISBN: 978-3-319-53637-8
eBook Packages: Computer ScienceComputer Science (R0)