Skip to main content

Automatic Identification of Academic Phrases for Czech

  • Conference paper
  • First Online:
  • 740 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11755))

Abstract

The aim of this study is to automatically extract academic phrases in Czech using data-mining techniques as a first step towards creating a dictionary of academic words and phrases targeting university-level students (L1 and L2). The decision to use data mining was based on excellent results of data mining in automatic recognition of single-word and multi-word terms [10]. This method has identified various types of academic phrases: structurally incomplete lexical bundles with specific functions in texts (e.g. na druhou stranuon the other hand), collocations (e.g. podrobná analýzadetailed analysis) or combinations of a content word and a typical function word (e.g. zaměřený na - focused on; podobný jako - similar to). The final list of automatically identified academic phrases is quite extensive and consists of 7,300 bigrams. Manual evaluation of the output data sample showed that precision of the automatic identification method is more than 72% and recall is 81%. The list of identified academic phrases is a very good starting point for the planned dictionary because the majority of the extracted bigrams constitute collocations typically used for academic texts. Such collocations are useful for the target audience, that is, university students interested in academic writing.

This paper has been, in part, funded by the Ministry of Education, Youth and Sports of the Czech Republic within the framework of Large Research, Development and Innovation Infrastructures (Czech National Corpus project, LM2015044). It was also supported by the European Regional Development Fund-Project “Creativity and Adaptability as Conditions of the Success of Europe in an Interrelated World” (No. CZ.02.1.01/0.0/0.0/16_019/0000734).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Data mining is a discipline partially overlaping with machine learning. In this study, we choose to use data mining terminology, because we are searching for useful information in vast amounts of language data. However, we recognize that this terminological preference is just a matter of a point of view.

  2. 2.

    By collocation we mean a meaningful combination of frequently co-occurring words, cf. e.g. McEnery and Hardie [13].

  3. 3.

    As these characteristics are mostly based on frequency or distribution, it can be assumed that similar values will allow automatic identification of academic phrases in other languages, e.g. English.

  4. 4.

    We only use lemmas, because they have proved to be more effective in previous research [10]. The reason for this is due to the rich morphology of the Czech language.

  5. 5.

    We used J48 decision tree models. These were selected as the most suitable method for automatic extraction of terms and non-terms in Kováříková [10]. Different pre-processing was chosen for each model: (1) unbalanced classes (phrases and non-phrases), (2) class balancer, and (3) resampling..

  6. 6.

    With the exception of a study that launched our interest in academic phrase list [11].

  7. 7.

    Currently, there is only one such list for Czech language, Akalex, that is limited in terms of size and completeness, cf. www.korpus.cz/akalex [2].

  8. 8.

    Precision is fraction of relevant instances among the retrieved instances, recall is fraction of relevant instances that have been retrieved over the total amount of relevant instances.

References

  1. Ackermann, K., Chen, Y.-H.: Developing the academic collocation list (ACL) – a corpus-driven and expert-judged approach. J. Engl. Acad. Purp. 12(4), 235–247 (2013)

    Article  Google Scholar 

  2. Akalex 2018: Lexikon akademické češtiny. Akalex 2018: A Lexicon of Academic Czech (in Czech) (2018). https://korpus.cz/akalex. Accessed 15 May 2019

  3. Biber, D., Barbieri, F.: Lexical bundles in university spoken and written registers. Engl. Specif. Purp. 26, 263–286 (2007)

    Article  Google Scholar 

  4. Chen, Y.-H., Baker, P.: Lexical bundles in L1 and L2 student writing. Lang. Learn. Technol. 14, 30–49 (2010)

    Google Scholar 

  5. Coxhead, A.: A new academic word list. TESOL Q. 34(2), 213–238 (2000)

    Article  Google Scholar 

  6. Durrant, P.: Investigating the viability of a collocation list for students of English for academic purposes. Engl. Specif. Purp. 28, 157–169 (2009)

    Article  Google Scholar 

  7. Frank, E., Hall, M.A., Witten, I.H.: The WEKA Workbench. Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques, 4th edn. Morgan Kaufmann, Burlington (2016)

    Google Scholar 

  8. Granger, S.: Academic phraseology: a key ingredient in successful L2 academic literacy. Oslo Stud. Engl. 9(3), 9–27 (2017)

    MathSciNet  Google Scholar 

  9. Hyland, K.: Bundles in academic discourse. Ann. Rev. Appl. Linguist. 32, 150–169 (2012)

    Article  Google Scholar 

  10. Kováříková, D.: Kvantitativní charakteristiky termínů. Quantitative Characteristics of Terms (in Czech). LN, Praha (2017)

    Google Scholar 

  11. Kováříková, D., Lukešová, L.: Extracting multi-word expressions for the Czech academic phrase list (conference presentation)

    Google Scholar 

  12. Křen, M., et al.: SYN2015: Representative Corpus of Written Czech. Institute of the Czech National Corpus, FFUK, Prague (2015). http://www.korpus.cz. Accessed 15 May 2019

  13. McEnery, T., Hardie, A.: Corpus Linguistics: Method, Theory and Practice. John Benjamins, Amsterdam (2012)

    Google Scholar 

  14. Simpson-Vlach, R., Ellis, N.: An academic formulas list: new methods in phraseology research. Appl. Linguist. 31(4), 487–512 (2010)

    Article  Google Scholar 

  15. Vincent, B.: Investigating academic phraseology through combinations of very frequent words: a methodological exploration. J. Engl. Acad. Purp. 12, 44–56 (2013)

    Article  Google Scholar 

  16. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, Amsterdam (2005)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Dominika Kováříková or Oleg Kovářík .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kováříková, D., Kovářík, O. (2019). Automatic Identification of Academic Phrases for Czech. In: Corpas Pastor, G., Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2019. Lecture Notes in Computer Science(), vol 11755. Springer, Cham. https://doi.org/10.1007/978-3-030-30135-4_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30135-4_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30134-7

  • Online ISBN: 978-3-030-30135-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics