Automatic Identification of Academic Phrases for Czech

Kováříková, Dominika; Kovářík, Oleg

doi:10.1007/978-3-030-30135-4_17

Automatic Identification of Academic Phrases for Czech

Dominika Kováříková¹⁰ &
Oleg Kovářík¹¹

Conference paper
First Online: 18 September 2019

740 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11755))

Abstract

The aim of this study is to automatically extract academic phrases in Czech using data-mining techniques as a first step towards creating a dictionary of academic words and phrases targeting university-level students (L1 and L2). The decision to use data mining was based on excellent results of data mining in automatic recognition of single-word and multi-word terms [10]. This method has identified various types of academic phrases: structurally incomplete lexical bundles with specific functions in texts (e.g. na druhou stranu – on the other hand), collocations (e.g. podrobná analýza – detailed analysis) or combinations of a content word and a typical function word (e.g. zaměřený na - focused on; podobný jako - similar to). The final list of automatically identified academic phrases is quite extensive and consists of 7,300 bigrams. Manual evaluation of the output data sample showed that precision of the automatic identification method is more than 72% and recall is 81%. The list of identified academic phrases is a very good starting point for the planned dictionary because the majority of the extracted bigrams constitute collocations typically used for academic texts. Such collocations are useful for the target audience, that is, university students interested in academic writing.

This paper has been, in part, funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures (Czech National Corpus project, LM2015044). It was also supported by the European Regional Development Fund-Project “Creativity and Adaptability as Conditions of the Success of Europe in an Interrelated World” (No. CZ.02.1.01/0.0/0.0/16_019/0000734).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Data mining is a discipline partially overlaping with machine learning. In this study, we choose to use data mining terminology, because we are searching for useful information in vast amounts of language data. However, we recognize that this terminological preference is just a matter of a point of view.
2.
By collocation we mean a meaningful combination of frequently co-occurring words, cf. e.g. McEnery and Hardie [13].
3.
As these characteristics are mostly based on frequency or distribution, it can be assumed that similar values will allow automatic identification of academic phrases in other languages, e.g. English.
4.
We only use lemmas, because they have proved to be more effective in previous research [10]. The reason for this is due to the rich morphology of the Czech language.
5.
We used J48 decision tree models. These were selected as the most suitable method for automatic extraction of terms and non-terms in Kováříková [10]. Different pre-processing was chosen for each model: (1) unbalanced classes (phrases and non-phrases), (2) class balancer, and (3) resampling..
6.
With the exception of a study that launched our interest in academic phrase list [11].
7.
Currently, there is only one such list for Czech language, Akalex, that is limited in terms of size and completeness, cf. www.korpus.cz/akalex [2].
8.
Precision is fraction of relevant instances among the retrieved instances, recall is fraction of relevant instances that have been retrieved over the total amount of relevant instances.

References

Ackermann, K., Chen, Y.-H.: Developing the academic collocation list (ACL) – a corpus-driven and expert-judged approach. J. Engl. Acad. Purp. 12(4), 235–247 (2013)
Article Google Scholar
Akalex 2018: Lexikon akademické češtiny. Akalex 2018: A Lexicon of Academic Czech (in Czech) (2018). https://korpus.cz/akalex. Accessed 15 May 2019
Biber, D., Barbieri, F.: Lexical bundles in university spoken and written registers. Engl. Specif. Purp. 26, 263–286 (2007)
Article Google Scholar
Chen, Y.-H., Baker, P.: Lexical bundles in L1 and L2 student writing. Lang. Learn. Technol. 14, 30–49 (2010)
Google Scholar
Coxhead, A.: A new academic word list. TESOL Q. 34(2), 213–238 (2000)
Article Google Scholar
Durrant, P.: Investigating the viability of a collocation list for students of English for academic purposes. Engl. Specif. Purp. 28, 157–169 (2009)
Article Google Scholar
Frank, E., Hall, M.A., Witten, I.H.: The WEKA Workbench. Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques, 4th edn. Morgan Kaufmann, Burlington (2016)
Google Scholar
Granger, S.: Academic phraseology: a key ingredient in successful L2 academic literacy. Oslo Stud. Engl. 9(3), 9–27 (2017)
MathSciNet Google Scholar
Hyland, K.: Bundles in academic discourse. Ann. Rev. Appl. Linguist. 32, 150–169 (2012)
Article Google Scholar
Kováříková, D.: Kvantitativní charakteristiky termínů. Quantitative Characteristics of Terms (in Czech). LN, Praha (2017)
Google Scholar
Kováříková, D., Lukešová, L.: Extracting multi-word expressions for the Czech academic phrase list (conference presentation)
Google Scholar
Křen, M., et al.: SYN2015: Representative Corpus of Written Czech. Institute of the Czech National Corpus, FFUK, Prague (2015). http://www.korpus.cz. Accessed 15 May 2019
McEnery, T., Hardie, A.: Corpus Linguistics: Method, Theory and Practice. John Benjamins, Amsterdam (2012)
Google Scholar
Simpson-Vlach, R., Ellis, N.: An academic formulas list: new methods in phraseology research. Appl. Linguist. 31(4), 487–512 (2010)
Article Google Scholar
Vincent, B.: Investigating academic phraseology through combinations of very frequent words: a methodological exploration. J. Engl. Acad. Purp. 12, 44–56 (2013)
Article Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, Amsterdam (2005)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Institute of the Czech National Corpus, Charles University, Prague, Czech Republic
Dominika Kováříková
Datamole, Banskobystrická 2080/11, 160 00, Prague, Czech Republic
Oleg Kovářík

Authors

Dominika Kováříková
View author publications
You can also search for this author in PubMed Google Scholar
Oleg Kovářík
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Dominika Kováříková or Oleg Kovářík .

Editor information

Editors and Affiliations

University of Malaga, Malaga, Spain
Gloria Corpas Pastor
University of Wolverhampton, Wolverhampton, UK
Ruslan Mitkov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kováříková, D., Kovářík, O. (2019). Automatic Identification of Academic Phrases for Czech. In: Corpas Pastor, G., Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2019. Lecture Notes in Computer Science(), vol 11755. Springer, Cham. https://doi.org/10.1007/978-3-030-30135-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-30135-4_17
Published: 18 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30134-7
Online ISBN: 978-3-030-30135-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics