Czech Dataset for Semantic Textual Similarity

  • Lukás̆ SvobodaEmail author
  • Tomás̆ Brychcín
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11107)


Semantic textual similarity is the core shared task at the International Workshop on Semantic Evaluation (SemEval). It focuses on sentence meaning comparison. So far, most of the research has been devoted to English.

In this paper we present first Czech dataset for semantic textual similarity. The dataset contains 1425 manually annotated pairs. Czech is highly inflected language and is considered challenging for many natural language processing tasks. The dataset is publicly available for the research community.

In 2016 we participated at SemEval competition and our UWB system were ranked as second among 113 submitted systems in monolingual subtask and first among 26 systems in cross-lingual subtask.

We adapt the UWB system for Czech (originally for English) and experiment with new Czech dataset. Our system achieves very promising results and can serve as a strong baseline for future research.


Czech dataset Semantic Textual similarity 



This work was supported by the project LO1506 of the Czech Ministry of Education, Youth and Sports and by Grant No. SGS-2016-018 Data and Software Engineering for Advanced Applications. Computational resources were provided by the CESNET LM2015042 and the CERIT Scientific Cloud LM2015085, provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures”.


  1. 1.
    Agirre, E., et al.: Semeval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, pp. 497–511. Association for Computational Linguistics, June 2016Google Scholar
  2. 2.
    Svoboda, L., Brychcín, T.: New word analogy corpus for exploring embeddings of Czech words. arXiv preprint arXiv:1608.00789 (2016)
  3. 3.
    Krčmář, L., Konopík, M., Ježek, K.: Exploration of semantic spaces obtained from Czech corpora. In: Proceedings of the Dateso 2011: Annual International Workshop on DAtabases, TExts, Specifications and Objects, Pisek, Czech Republic, 20 April 2011, pp. 97–107 (2011)Google Scholar
  4. 4.
    Cinková, S.: WordSim353 for Czech. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 190–197. Springer, Cham (2016). Scholar
  5. 5.
    Straková, J., Straka, M., Hajic, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: ACL (System Demonstrations), pp. 13–18 (2014)Google Scholar
  6. 6.
    Brychcín, T., Konopík, M.: HPS: high precision stemmer. Inf. Process. Manage. 51(1), 68–91 (2015)CrossRefGoogle Scholar
  7. 7.
    Brychcín, T., Svoboda, L.: UWB at SemEval-2016 task 1: Semantic textual similarity using lexical, syntactic, and semantic information. In: Proceedings of SemEval, pp. 588–594 (2016)Google Scholar
  8. 8.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  9. 9.
    Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1 (2013)Google Scholar
  10. 10.
    Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)CrossRefGoogle Scholar
  11. 11.
    Pelletier, F.J.: The principle of semantic compositionality. Topoi 13(1), 11–24 (1994)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013)Google Scholar
  13. 13.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)Google Scholar
  14. 14.
    Platt, J.: Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1998)Google Scholar
  15. 15.
    Levy, O., Søgaard, A., Goldberg, Y.: A strong baseline for learning cross-lingual word embeddings from sentence alignments. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Volume 1: Long Papers, vol. 1, pp. 765–774 (2017)Google Scholar
  16. 16.
    Zou, W.Y., Socher, R., Cer, D., Manning, C.D.: Bilingual word embeddings for phrase-based machine translation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1393–1398 (2013)Google Scholar
  17. 17.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  18. 18.
    Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 363–372 (2015)Google Scholar
  19. 19.
    Gouws, S., Søgaard, A.: Simple task-specific bilingual word embeddings. In: HLT-NAACL, pp. 1386–1390 (2015)Google Scholar
  20. 20.
    Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.University of West BohemiaPilsenCzech Republic

Personalised recommendations