Abstract
We propose a multidimensional taxonomy of multiword expressions (MWEs) as a pattern applicable to entries in a representative lexicon of Czech MWEs. The taxonomy and the lexicon are useful for many reasons concerning lexicography, teaching Czech as a foreign language, and theoretical issues of MWEs as entities standing between lexicon and grammar, as well as for NLP tasks such as tagging and parsing, identification and search of MWEs, or word sense and semantic disambiguation. In addition to the description of various types of idiomaticity, the taxonomy and the lexicon are designed to account for flexibility in morphology and word order, syntactic and lexical variants and even creatively used fragments.
This paper is part of the project Between Lexicon and Grammar (2016–2018), supported by the Grant Agency of the Czech Republic, reg. no. 16-07473S.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Even velbloud ‘camel’ is sometimes missing. Czech is no exception, cf. a similarly creative use in English: You’d have an easier time getting a camel through the eye of a needle than getting them to agree on the issue (Farlex Dictionary of Idioms, 2015).
- 2.
SYN2010, SYN2015 (122 million tokens each), SYN release 4 and 5 (4.3 and 4.6 billion tokens, respectively).
- 3.
Valency is a notable exception from this principle, cf. Sect. 5.4.
- 4.
- 5.
Our taxonomy treats collocations as “statistically idiomatic MWEs” (see Sect. 7.6).
- 6.
Seretan uses the term collocation in the linguistic, syntactically motivated sense. For the broader, statistically defined class she adopts the term co-occurrence.
- 7.
- 8.
Some of the types below can be binomials, i.e. expressions containing two words that are juxtaposed, or joined by a conjunction (usually and or or) or preposition: day and night, G. gang und gäbe ‘usual’, Cz. [děvče] krev a mlíko ‘a ruddy, healthy-looking, full-figured [girl]’ (lit. ‘[a girl] blood and milk’).
- 9.
In prepositional phrases (and in some other kinds of phrases), two syntactic heads are distinguished: (i) surface syntactic head and (ii) deep syntactic head constituted by the preposition and the NP’s head noun, respectively.
- 10.
- 11.
Similarly as traditional Czech grammars, we see topicalization as a word-order rather than transformation phenomenon.
- 12.
The structure of the lexical entry is based on the principles of structured lexical description proposed in [29], though in a substantially simplified way.
References
Baldwin, T., Kim, S.N.: Multiword expressions. In: Indurkhya, N., Damerau, F.J. (eds.) Handbook of Natural Language Processing, 2nd edn., pp. 267–292. CRC Press, Boca Raton (2010)
Barnbrook, G., Mason, O., Krishnamurthy, R.: Collocation. Applications and Implications. Palgrave Macmillan UK, Basingstoke (2013)
Bejček, E., Hajič, J., Straňák, P., Urešová, Z.: Extracting verbal multiword data from rich treebank annotation. In: Proceedings of the 15th International Workshop on Treebanks and Linguistic Theories (TLT 2015), pp. 13–24. Indiana University, Bloomington (2017)
Burger, H., Dobrovol’skij, D., Kühn, P., Norrick, N.R. (eds.): Phraseology: An International Handbook of Contemporary Research. Walter de Gruyter, Berlin, New York (2007)
Burger, H., Dobrovol’skij, D., Kühn, P., Norrick, N.R.: Phraseology: subject area, terminology and research topic. In: Burger et al. [4], pp. 10–19
Cvrček, V.: Kvantitativní analýza kontextu. Nakladatelství Lidové noviny, Prague (2014)
Dobrovol’skij, D., Filipenko, T.: Russian phraseology. In: Burger et al. [4], pp. 714–727
Čermák, F.: Czech and General Phraseology. Karolinum, Prague (2007)
Čermák, F.: Lexikon a sémantika. Nakladatelstí Lidové noviny, Prague (2010)
Čermák, F.: Frazeologie a idiomatika: Jejich podstata a proměnlivost názor\({\mathring{\text{ u }}}\) na ně. Časopis pro moderní filologii 98(2), 199–217 (2016)
Čermák, F., et al.: Slovník české frazeologie a idiomatiky (SČFI), vol. 1–4. Academia/Leda, Prague (1983–2009)
Evert, S.: The statistics of word cooccurrences: word pairs and collocations. Ph.D. thesis, IMS, University of Stuttgart, Stuttgart (2004). http://www.collocations.de
Hnátková, M.: Značkování frazém\({\mathring{\text{ u }}}\) a idiom\({\mathring{\text{ u }}}\) v Českém národním korpusu s pomocí slovníku české frazeologie a idiomatiky. Slovo a slovesnost 63(2), 117–126 (2002)
Klégr, A.: Lexikální kolokace: základní přehled o vývoji pojetí. Časopis pro moderní filologii 98(1), 95–103 (2016)
Lauriston, A.: Criteria for measuring term recognition. In: Proceedings of the Seventh Conference on European Chapter of the Association for Computational Linguistics, EACL 1995, pp. 17–22. Morgan Kaufmann Publishers, San Francisco (1995)
Lopatková, M., Kettnerová, V., Bejček, E., Skwarska, K., Žabokrtský, Z.: VALLEX 2.6.3 - Valency Lexicon of Czech Verbs. Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University (2014)
Martins, A., Almeida, M., Smith, N.A.: Turning on the turbo: fast third-order non-projective turbo parsers. In: Annual Meeting of the Association for Computational Linguistics - ACL, pp. 617–622 (2013)
Mel’čuk, I.: Collocations: définition, rôle et utilité. In: Grossmann, F., Tutin, A. (eds.) Les collocations: analyse et traitement, pp. 23–32. De Werelt, Amsterdam (2003)
Mieder, W.: Proverbs Are Never Out of Season: Popular Wisdom in the Modern Age. Peter Lang, New York (2012)
Moon, R.: Corpus linguistic approaches with English corpora. In: Burger et al. [4], pp. 1045–1059
Nunberg, G., Sag, I.A., Wasow, T.: Idioms. Language 70(3), 491–538 (1994)
Pecina, P.: Lexical association measures and collocation extraction. Lang. Resour. Eval. 44(1–2), 137–158 (2010)
Przepiórkowski, A., Hajič, J., Hajnicz, E., Urešová, Z.: Phraseology in two Slavic valency dictionaries: limitations and perspectives. Int. J. Lexicogr. 30(1), 1–38 (2017)
Richter, F., Sailer, M.: Idiome mit phraseologisierten Teilsätzen: Eine Fallstudie zur Formalisierung von Konstruktionen im Rahmen der HPSG. In: Lasch, A., Ziem, A. (eds.) Grammatik als Netzwerk von Konstruktionen, pp. 291–312. de Gruyter, Berlin (2014)
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). doi:10.1007/3-540-45715-1_1
Seretan, V.: Syntax-Based Collocation Extraction. Text, Speech and Language Technology, vol. 44. Springer, Dordrecht (2011). doi:10.1007/978-94-007-0134-2
Sinclair, J.M.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)
Urešová, Z.: Building the PDT-VALLEX valency lexicon. In: On-line Proceedings of the Fifth Corpus Linguistics Conference. University of Liverpool (2009)
Vondřička, P.: Formalized contrastive lexical description: a framework for bilingual dictionaries. LINCOM GmbH, München (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Hnátková, M. et al. (2017). Eye of a Needle in a Haystack. In: Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2017. Lecture Notes in Computer Science(), vol 10596. Springer, Cham. https://doi.org/10.1007/978-3-319-69805-2_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-69805-2_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69804-5
Online ISBN: 978-3-319-69805-2
eBook Packages: Computer ScienceComputer Science (R0)