Skip to main content

KAS-term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2019)

Abstract

This paper presents a dataset and supervised learning experiments for term extraction from Slovene academic texts. Term candidates in the dataset were extracted via morphosyntactic patterns and annotated for their termness by four annotators. Experiments on the dataset show that most co-occurrence statistics, applied after morphosyntactic patterns and a frequency threshold, perform close to random and that the results can be significantly improved by combining, with supervised machine learning, all the seven statistic measures included in the dataset. On multi-word terms the model using all statistics obtains an AUC of 0.736 while the best single statistic produces only AUC 0.590. Among many additional candidate features, only adding multi-word morphosyntactic pattern information and length of the single-word term candidates achieves further improvements of the results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The complete KAS corpus and the KAS-Dr corpus of PhDs are available for exploring through the CLARIN.SI concordancers, http://www.clarin.si/info/concordances/.

  2. 2.

    http://hdl.handle.net/11356/1198.

References

  1. Bada, M., et al.: Concept annotation in the CRAFT corpus. BMC Bioinformatics 13, 161 (2012)

    Article  Google Scholar 

  2. Bernier-Colborne, G., Drouin, P.: Creating a test corpus for term extractors through term annotation. Terminology 20, 50–73 (2014)

    Article  Google Scholar 

  3. Conrado, M., Pardo, T., Rezende, S.: A machine learning approach to automatic term extraction using a rich feature set. In: Proceedings of the 2013 NAACL HLT Student Research Workshop, pp. 16–23 (2013)

    Google Scholar 

  4. Erjavec, T., Fišer, D., Ljubešić, N., Logar, N., Ojsteršek, M.: Slovenska znanstvena besedila: prototipni korpus in načrt analiz, Slovene scientific texts: prototype corpus and research plan. In: Proceedings of the Conference on Language Technologies and Digital Humanities. Ljubljana University Press (2016)

    Google Scholar 

  5. Erjavec, T., et al.: Terminology identification dataset KAS-term 1.0, slovenian language resource repository CLARIN.SI (2018). http://hdl.handle.net/11356/1198

  6. Fišer, D., Suchomel, V., Jakubiček, M.: Terminology extraction for academic Slovene using sketch engine. In: RASLAN 2016: Recent Advances in Slavonic Natural Language Processing, pp. 135–141 (2016)

    Google Scholar 

  7. Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: the C-value/NC-value method. Int. J. Digit. Libr. 3(2), 115–130 (2000). https://doi.org/10.1007/s007999900023

    Article  Google Scholar 

  8. Nakagawa, H., Mori, T.: Automatic term recognition based on statistics of compound nouns and their components. Terminology 9(2), 201–219 (2003)

    Article  Google Scholar 

  9. Handschuh, S., QasemiZadeh, B.: The ACL RD-TEC: a dataset for benchmarking terminology extraction and classification in computational linguistics. In: COLING 2014: 4th International Workshop on Computational Terminology (2014)

    Google Scholar 

  10. Kim, J.D., Ohta, T., Tateisi, Y., Ichi Tsujii, J.: GENIA corpus - a semantically annotated corpus for bio-textmining. In: ISMB (Supplement of Bioinformatics), pp. 180–182 (2003)

    Article  Google Scholar 

  11. Loukachevitch, N.V.: Automatic term recognition needs multiple evidence. In: LREC, pp. 2401–2407 (2012)

    Google Scholar 

  12. Ojsteršek, M., et al.: Vzpostavitev repozitorijev slovenskih univerz in nacionalnega portala odprte znanosti (the set-up of the repository of slovene universities and the national portal of open science). Knjižnica 58(3) (2014)

    Google Scholar 

  13. Pazienza, M., Pennacchiotti, M., Zanzotto, F.: Terminology extraction: an analysis of linguistic and statistical approaches. In: Sirmakessis, S. (ed.) Knowledge Mining. Studies in Fuzziness and Soft Computing, vol. 185, pp. 255–279. Springer, Heidelberg (2005). https://doi.org/10.1007/3-540-32394-5_20

    Chapter  Google Scholar 

  14. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  15. Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., Gornostay, T.: Term extraction, tagging, and mapping tools for under-resourced languages. In: Proceedings of the Terminology and Knowledge Engineering (TKE2012) Conference (2012)

    Google Scholar 

  16. Schäfer, J., Rösiger, I., Heid, U., Dorna, M.: Evaluating noise reduction strategies for terminology extraction. In: TIA, pp. 123–131 (2015)

    Google Scholar 

Download references

Acknowledgements

The work described in this paper was funded by the Slovenian Research Agency within the national basic research project “Slovene scientific texts: resources and description” (J6-7094, 2016–2019).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikola Ljubešić .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ljubešić, N., Fišer, D., Erjavec, T. (2019). KAS-term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27947-9_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27946-2

  • Online ISBN: 978-3-030-27947-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics