KAS-term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning

Ljubešić, Nikola; Fišer, Darja; Erjavec, Tomaž

doi:10.1007/978-3-030-27947-9_10

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

820 Accesses
6 Citations

Abstract

This paper presents a dataset and supervised learning experiments for term extraction from Slovene academic texts. Term candidates in the dataset were extracted via morphosyntactic patterns and annotated for their termness by four annotators. Experiments on the dataset show that most co-occurrence statistics, applied after morphosyntactic patterns and a frequency threshold, perform close to random and that the results can be significantly improved by combining, with supervised machine learning, all the seven statistic measures included in the dataset. On multi-word terms the model using all statistics obtains an AUC of 0.736 while the best single statistic produces only AUC 0.590. Among many additional candidate features, only adding multi-word morphosyntactic pattern information and length of the single-word term candidates achieves further improvements of the results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The complete KAS corpus and the KAS-Dr corpus of PhDs are available for exploring through the CLARIN.SI concordancers, http://www.clarin.si/info/concordances/.
2.
http://hdl.handle.net/11356/1198.

References

Bada, M., et al.: Concept annotation in the CRAFT corpus. BMC Bioinformatics 13, 161 (2012)
Article Google Scholar
Bernier-Colborne, G., Drouin, P.: Creating a test corpus for term extractors through term annotation. Terminology 20, 50–73 (2014)
Article Google Scholar
Conrado, M., Pardo, T., Rezende, S.: A machine learning approach to automatic term extraction using a rich feature set. In: Proceedings of the 2013 NAACL HLT Student Research Workshop, pp. 16–23 (2013)
Google Scholar
Erjavec, T., Fišer, D., Ljubešić, N., Logar, N., Ojsteršek, M.: Slovenska znanstvena besedila: prototipni korpus in načrt analiz, Slovene scientific texts: prototype corpus and research plan. In: Proceedings of the Conference on Language Technologies and Digital Humanities. Ljubljana University Press (2016)
Google Scholar
Erjavec, T., et al.: Terminology identification dataset KAS-term 1.0, slovenian language resource repository CLARIN.SI (2018). http://hdl.handle.net/11356/1198
Fišer, D., Suchomel, V., Jakubiček, M.: Terminology extraction for academic Slovene using sketch engine. In: RASLAN 2016: Recent Advances in Slavonic Natural Language Processing, pp. 135–141 (2016)
Google Scholar
Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: the C-value/NC-value method. Int. J. Digit. Libr. 3(2), 115–130 (2000). https://doi.org/10.1007/s007999900023
Article Google Scholar
Nakagawa, H., Mori, T.: Automatic term recognition based on statistics of compound nouns and their components. Terminology 9(2), 201–219 (2003)
Article Google Scholar
Handschuh, S., QasemiZadeh, B.: The ACL RD-TEC: a dataset for benchmarking terminology extraction and classification in computational linguistics. In: COLING 2014: 4th International Workshop on Computational Terminology (2014)
Google Scholar
Kim, J.D., Ohta, T., Tateisi, Y., Ichi Tsujii, J.: GENIA corpus - a semantically annotated corpus for bio-textmining. In: ISMB (Supplement of Bioinformatics), pp. 180–182 (2003)
Article Google Scholar
Loukachevitch, N.V.: Automatic term recognition needs multiple evidence. In: LREC, pp. 2401–2407 (2012)
Google Scholar
Ojsteršek, M., et al.: Vzpostavitev repozitorijev slovenskih univerz in nacionalnega portala odprte znanosti (the set-up of the repository of slovene universities and the national portal of open science). Knjižnica 58(3) (2014)
Google Scholar
Pazienza, M., Pennacchiotti, M., Zanzotto, F.: Terminology extraction: an analysis of linguistic and statistical approaches. In: Sirmakessis, S. (ed.) Knowledge Mining. Studies in Fuzziness and Soft Computing, vol. 185, pp. 255–279. Springer, Heidelberg (2005). https://doi.org/10.1007/3-540-32394-5_20
Chapter Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., Gornostay, T.: Term extraction, tagging, and mapping tools for under-resourced languages. In: Proceedings of the Terminology and Knowledge Engineering (TKE2012) Conference (2012)
Google Scholar
Schäfer, J., Rösiger, I., Heid, U., Dorna, M.: Evaluating noise reduction strategies for terminology extraction. In: TIA, pp. 123–131 (2015)
Google Scholar

Download references

Acknowledgements

The work described in this paper was funded by the Slovenian Research Agency within the national basic research project “Slovene scientific texts: resources and description” (J6-7094, 2016–2019).

Author information

Authors and Affiliations

Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia
Nikola Ljubešić, Darja Fišer & Tomaž Erjavec
Department of Translation, Faculty of Arts, University of Ljubljana, Ljubljana, Slovenia
Darja Fišer

Authors

Nikola Ljubešić
View author publications
You can also search for this author in PubMed Google Scholar
Darja Fišer
View author publications
You can also search for this author in PubMed Google Scholar
Tomaž Erjavec
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikola Ljubešić .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ljubešić, N., Fišer, D., Erjavec, T. (2019). KAS-term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-27947-9_10
Published: 06 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics