Comparative Evaluation and Integration of Collocation Extraction Metrics

  • Victor ZakharovEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10415)


The paper deals with collocation extraction from corpus data. A whole number of formulae have been created to integrate different factors that determine the association between the collocation components. The experiments are described which objective was to study the method of collocation extraction based on the statistical association measures. The work is focused on bigram collocations. The obtained data on the measure precision allow to establish to some degree that some measures are more precise than others. No measure is ideal, which is why various options of their integration are desirable and useful. We propose a number of parameters that allow to rank collocates in an combined list, namely, an average rank, a normalized rank and an optimized rank.


Collocation extraction Association measures Evaluation Ranking Average rank Normalized rank Optimized rank 



This work was partly supported by the grant of the Russian Foundation for Humanities (research project No. 16-04-12019).


  1. 1.
    Evert, S.: The statistics of word cooccurences word pairs and collocations. Ph.D. thesis, Institut für Maschinelle Sprachverarbeitung (IMS), Stuttgart (2004) Google Scholar
  2. 2.
    Pecina, P.: Lexical association measures and collocation extraction. Lang. Resour. Eval. 44(1–2), 137–158 (2009). PragueGoogle Scholar
  3. 3.
    Halliday, M.: Current Ideas in Systemic Practice and Theory. Pinter, London (1991)Google Scholar
  4. 4.
    Daille, B.: Mixed approach for the automatic extraction of terminology: lexical statistics and linguistic filters [Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques]. Ph.D. thesis, Université Paris 7 (1994)Google Scholar
  5. 5.
    Kilgarriff, A., Tugwell, D.: Sketching words. In: Correard, M.H. (ed.) Lexicography and Natural Language Processing: A Festschrift in Honour of B.T.S. Atkins, pp. 125–137. Euralex, Goteborg (2002)Google Scholar
  6. 6.
    Seretan, V.: Syntax-Based Collocation Extraction. Text, Speech and Language. Springer, Dordrecht (2011)CrossRefzbMATHGoogle Scholar
  7. 7.
    Křen, M.: Collocation Measures and the Czech Language: Comparison on the Czech National Corpus data [Kolokační míry a čeština: srovnání na datech Českého národního korpusu], pp. 223–248. Kolokace, Praha (2006)Google Scholar
  8. 8.
    Zakharov, V., Khokhlova, M.: Syntagmatic relations in Russian corpora and dictionaries. In: Schoepe, K., et al. (eds.) Pragmantax II. The Present State of Linguistics and its Sub-Disciplines, pp. 333–344. Peter Lang, Frankfurt a.M. (2014)Google Scholar
  9. 9.
    Rychlý, P.: Manatee/Bonito – a modular corpus manager. In: 1st Workshop on Recent Advances in Slavonic Natural Language Processing, pp. 65–70. Masaryk University, Brno (2007)Google Scholar
  10. 10.
    Benko, V.: Aranea: yet another family of (comparable) web corpora. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS, vol. 8655, pp. 247–256. Springer, Cham (2014). doi: 10.1007/978-3-319-10816-2_31 Google Scholar
  11. 11.
    Statistics Used in Sketch Engine. Accessed 3 Feb 2017
  12. 12.
    Ashmanov, I., Grigoryev, S., Gusev, V., Kharin, N., Shabanov, V.: Using statistical method for intelligent computer-based text processing [Primenenie statisticheskih metodov dlja intellektual’noj komp’juternoj obrabotki tekstov]. In: The Proceedings of the Dialog 1997 International Seminar on Computational Linguistics and Its Applications, pp. 33–37 (1997)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Saint-Petersburg State UniversitySaint-PetersburgRussia

Personalised recommendations