Skip to main content

Eye of a Needle in a Haystack

Multiword Expressions in Czech: Typology and Lexicon

  • Conference paper
  • First Online:
Computational and Corpus-Based Phraseology (EUROPHRAS 2017)

Abstract

We propose a multidimensional taxonomy of multiword expressions (MWEs) as a pattern applicable to entries in a representative lexicon of Czech MWEs. The taxonomy and the lexicon are useful for many reasons concerning lexicography, teaching Czech as a foreign language, and theoretical issues of MWEs as entities standing between lexicon and grammar, as well as for NLP tasks such as tagging and parsing, identification and search of MWEs, or word sense and semantic disambiguation. In addition to the description of various types of idiomaticity, the taxonomy and the lexicon are designed to account for flexibility in morphology and word order, syntactic and lexical variants and even creatively used fragments.

This paper is part of the project Between Lexicon and Grammar (2016–2018), supported by the Grant Agency of the Czech Republic, reg. no. 16-07473S.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 95.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Even velbloud ‘camel’ is sometimes missing. Czech is no exception, cf. a similarly creative use in English: You’d have an easier time getting a camel through the eye of a needle than getting them to agree on the issue (Farlex Dictionary of Idioms, 2015).

  2. 2.

    SYN2010, SYN2015 (122 million tokens each), SYN release 4 and 5 (4.3 and 4.6 billion tokens, respectively).

  3. 3.

    Valency is a notable exception from this principle, cf. Sect. 5.4.

  4. 4.

    For more cf. https://typo.uni-konstanz.de/parseme/index.php/results/papers.

  5. 5.

    Our taxonomy treats collocations as “statistically idiomatic MWEs” (see Sect. 7.6).

  6. 6.

    Seretan uses the term collocation in the linguistic, syntactically motivated sense. For the broader, statistically defined class she adopts the term co-occurrence.

  7. 7.

    Cf. http://typo.uni-konstanz.de/parseme.

  8. 8.

    Some of the types below can be binomials, i.e. expressions containing two words that are juxtaposed, or joined by a conjunction (usually and or or) or preposition: day and night, G. gang und gäbe ‘usual’, Cz. [děvče] krev a mlíko ‘a ruddy, healthy-looking, full-figured [girl]’ (lit. ‘[a girl] blood and milk’).

  9. 9.

    In prepositional phrases (and in some other kinds of phrases), two syntactic heads are distinguished: (i) surface syntactic head and (ii) deep syntactic head constituted by the preposition and the NP’s head noun, respectively.

  10. 10.

    Currently, we use VALLEX [16]. We plan to add entries and/or more information from other electronically available valency lexicons of Czech, such as PDT-VALLEX, and use MWE-related information available in some valency lexicons, cf. [28] and [23].

  11. 11.

    Similarly as traditional Czech grammars, we see topicalization as a word-order rather than transformation phenomenon.

  12. 12.

    The structure of the lexical entry is based on the principles of structured lexical description proposed in [29], though in a substantially simplified way.

References

  1. Baldwin, T., Kim, S.N.: Multiword expressions. In: Indurkhya, N., Damerau, F.J. (eds.) Handbook of Natural Language Processing, 2nd edn., pp. 267–292. CRC Press, Boca Raton (2010)

    Google Scholar 

  2. Barnbrook, G., Mason, O., Krishnamurthy, R.: Collocation. Applications and Implications. Palgrave Macmillan UK, Basingstoke (2013)

    Book  Google Scholar 

  3. Bejček, E., Hajič, J., Straňák, P., Urešová, Z.: Extracting verbal multiword data from rich treebank annotation. In: Proceedings of the 15th International Workshop on Treebanks and Linguistic Theories (TLT 2015), pp. 13–24. Indiana University, Bloomington (2017)

    Google Scholar 

  4. Burger, H., Dobrovol’skij, D., Kühn, P., Norrick, N.R. (eds.): Phraseology: An International Handbook of Contemporary Research. Walter de Gruyter, Berlin, New York (2007)

    Google Scholar 

  5. Burger, H., Dobrovol’skij, D., Kühn, P., Norrick, N.R.: Phraseology: subject area, terminology and research topic. In: Burger et al. [4], pp. 10–19

    Google Scholar 

  6. Cvrček, V.: Kvantitativní analýza kontextu. Nakladatelství Lidové noviny, Prague (2014)

    Google Scholar 

  7. Dobrovol’skij, D., Filipenko, T.: Russian phraseology. In: Burger et al. [4], pp. 714–727

    Google Scholar 

  8. Čermák, F.: Czech and General Phraseology. Karolinum, Prague (2007)

    Google Scholar 

  9. Čermák, F.: Lexikon a sémantika. Nakladatelstí Lidové noviny, Prague (2010)

    Google Scholar 

  10. Čermák, F.: Frazeologie a idiomatika: Jejich podstata a proměnlivost názor\({\mathring{\text{ u }}}\) na ně. Časopis pro moderní filologii 98(2), 199–217 (2016)

    Google Scholar 

  11. Čermák, F., et al.: Slovník české frazeologie a idiomatiky (SČFI), vol. 1–4. Academia/Leda, Prague (1983–2009)

    Google Scholar 

  12. Evert, S.: The statistics of word cooccurrences: word pairs and collocations. Ph.D. thesis, IMS, University of Stuttgart, Stuttgart (2004). http://www.collocations.de

  13. Hnátková, M.: Značkování frazém\({\mathring{\text{ u }}}\) a idiom\({\mathring{\text{ u }}}\) v Českém národním korpusu s pomocí slovníku české frazeologie a idiomatiky. Slovo a slovesnost 63(2), 117–126 (2002)

    Google Scholar 

  14. Klégr, A.: Lexikální kolokace: základní přehled o vývoji pojetí. Časopis pro moderní filologii 98(1), 95–103 (2016)

    Google Scholar 

  15. Lauriston, A.: Criteria for measuring term recognition. In: Proceedings of the Seventh Conference on European Chapter of the Association for Computational Linguistics, EACL 1995, pp. 17–22. Morgan Kaufmann Publishers, San Francisco (1995)

    Google Scholar 

  16. Lopatková, M., Kettnerová, V., Bejček, E., Skwarska, K., Žabokrtský, Z.: VALLEX 2.6.3 - Valency Lexicon of Czech Verbs. Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University (2014)

    Google Scholar 

  17. Martins, A., Almeida, M., Smith, N.A.: Turning on the turbo: fast third-order non-projective turbo parsers. In: Annual Meeting of the Association for Computational Linguistics - ACL, pp. 617–622 (2013)

    Google Scholar 

  18. Mel’čuk, I.: Collocations: définition, rôle et utilité. In: Grossmann, F., Tutin, A. (eds.) Les collocations: analyse et traitement, pp. 23–32. De Werelt, Amsterdam (2003)

    Google Scholar 

  19. Mieder, W.: Proverbs Are Never Out of Season: Popular Wisdom in the Modern Age. Peter Lang, New York (2012)

    Google Scholar 

  20. Moon, R.: Corpus linguistic approaches with English corpora. In: Burger et al. [4], pp. 1045–1059

    Google Scholar 

  21. Nunberg, G., Sag, I.A., Wasow, T.: Idioms. Language 70(3), 491–538 (1994)

    Article  Google Scholar 

  22. Pecina, P.: Lexical association measures and collocation extraction. Lang. Resour. Eval. 44(1–2), 137–158 (2010)

    Article  Google Scholar 

  23. Przepiórkowski, A., Hajič, J., Hajnicz, E., Urešová, Z.: Phraseology in two Slavic valency dictionaries: limitations and perspectives. Int. J. Lexicogr. 30(1), 1–38 (2017)

    Google Scholar 

  24. Richter, F., Sailer, M.: Idiome mit phraseologisierten Teilsätzen: Eine Fallstudie zur Formalisierung von Konstruktionen im Rahmen der HPSG. In: Lasch, A., Ziem, A. (eds.) Grammatik als Netzwerk von Konstruktionen, pp. 291–312. de Gruyter, Berlin (2014)

    Google Scholar 

  25. Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). doi:10.1007/3-540-45715-1_1

    Chapter  Google Scholar 

  26. Seretan, V.: Syntax-Based Collocation Extraction. Text, Speech and Language Technology, vol. 44. Springer, Dordrecht (2011). doi:10.1007/978-94-007-0134-2

  27. Sinclair, J.M.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)

    Google Scholar 

  28. Urešová, Z.: Building the PDT-VALLEX valency lexicon. In: On-line Proceedings of the Fifth Corpus Linguistics Conference. University of Liverpool (2009)

    Google Scholar 

  29. Vondřička, P.: Formalized contrastive lexical description: a framework for bilingual dictionaries. LINCOM GmbH, München (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hana Skoumalová .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Hnátková, M. et al. (2017). Eye of a Needle in a Haystack. In: Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2017. Lecture Notes in Computer Science(), vol 10596. Springer, Cham. https://doi.org/10.1007/978-3-319-69805-2_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-69805-2_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69804-5

  • Online ISBN: 978-3-319-69805-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics