Abstract
This paper presents a dataset and supervised learning experiments for term extraction from Slovene academic texts. Term candidates in the dataset were extracted via morphosyntactic patterns and annotated for their termness by four annotators. Experiments on the dataset show that most co-occurrence statistics, applied after morphosyntactic patterns and a frequency threshold, perform close to random and that the results can be significantly improved by combining, with supervised machine learning, all the seven statistic measures included in the dataset. On multi-word terms the model using all statistics obtains an AUC of 0.736 while the best single statistic produces only AUC 0.590. Among many additional candidate features, only adding multi-word morphosyntactic pattern information and length of the single-word term candidates achieves further improvements of the results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The complete KAS corpus and the KAS-Dr corpus of PhDs are available for exploring through the CLARIN.SI concordancers, http://www.clarin.si/info/concordances/.
- 2.
References
Bada, M., et al.: Concept annotation in the CRAFT corpus. BMC Bioinformatics 13, 161 (2012)
Bernier-Colborne, G., Drouin, P.: Creating a test corpus for term extractors through term annotation. Terminology 20, 50–73 (2014)
Conrado, M., Pardo, T., Rezende, S.: A machine learning approach to automatic term extraction using a rich feature set. In: Proceedings of the 2013 NAACL HLT Student Research Workshop, pp. 16–23 (2013)
Erjavec, T., Fišer, D., Ljubešić, N., Logar, N., Ojsteršek, M.: Slovenska znanstvena besedila: prototipni korpus in načrt analiz, Slovene scientific texts: prototype corpus and research plan. In: Proceedings of the Conference on Language Technologies and Digital Humanities. Ljubljana University Press (2016)
Erjavec, T., et al.: Terminology identification dataset KAS-term 1.0, slovenian language resource repository CLARIN.SI (2018). http://hdl.handle.net/11356/1198
Fišer, D., Suchomel, V., Jakubiček, M.: Terminology extraction for academic Slovene using sketch engine. In: RASLAN 2016: Recent Advances in Slavonic Natural Language Processing, pp. 135–141 (2016)
Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: the C-value/NC-value method. Int. J. Digit. Libr. 3(2), 115–130 (2000). https://doi.org/10.1007/s007999900023
Nakagawa, H., Mori, T.: Automatic term recognition based on statistics of compound nouns and their components. Terminology 9(2), 201–219 (2003)
Handschuh, S., QasemiZadeh, B.: The ACL RD-TEC: a dataset for benchmarking terminology extraction and classification in computational linguistics. In: COLING 2014: 4th International Workshop on Computational Terminology (2014)
Kim, J.D., Ohta, T., Tateisi, Y., Ichi Tsujii, J.: GENIA corpus - a semantically annotated corpus for bio-textmining. In: ISMB (Supplement of Bioinformatics), pp. 180–182 (2003)
Loukachevitch, N.V.: Automatic term recognition needs multiple evidence. In: LREC, pp. 2401–2407 (2012)
Ojsteršek, M., et al.: Vzpostavitev repozitorijev slovenskih univerz in nacionalnega portala odprte znanosti (the set-up of the repository of slovene universities and the national portal of open science). Knjižnica 58(3) (2014)
Pazienza, M., Pennacchiotti, M., Zanzotto, F.: Terminology extraction: an analysis of linguistic and statistical approaches. In: Sirmakessis, S. (ed.) Knowledge Mining. Studies in Fuzziness and Soft Computing, vol. 185, pp. 255–279. Springer, Heidelberg (2005). https://doi.org/10.1007/3-540-32394-5_20
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., Gornostay, T.: Term extraction, tagging, and mapping tools for under-resourced languages. In: Proceedings of the Terminology and Knowledge Engineering (TKE2012) Conference (2012)
Schäfer, J., Rösiger, I., Heid, U., Dorna, M.: Evaluating noise reduction strategies for terminology extraction. In: TIA, pp. 123–131 (2015)
Acknowledgements
The work described in this paper was funded by the Slovenian Research Agency within the national basic research project “Slovene scientific texts: resources and description” (J6-7094, 2016–2019).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Ljubešić, N., Fišer, D., Erjavec, T. (2019). KAS-term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-27947-9_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)