Czech Dataset for Semantic Textual Similarity
Semantic textual similarity is the core shared task at the International Workshop on Semantic Evaluation (SemEval). It focuses on sentence meaning comparison. So far, most of the research has been devoted to English.
In this paper we present first Czech dataset for semantic textual similarity. The dataset contains 1425 manually annotated pairs. Czech is highly inflected language and is considered challenging for many natural language processing tasks. The dataset is publicly available for the research community.
In 2016 we participated at SemEval competition and our UWB system were ranked as second among 113 submitted systems in monolingual subtask and first among 26 systems in cross-lingual subtask.
We adapt the UWB system for Czech (originally for English) and experiment with new Czech dataset. Our system achieves very promising results and can serve as a strong baseline for future research.
KeywordsCzech dataset Semantic Textual similarity
This work was supported by the project LO1506 of the Czech Ministry of Education, Youth and Sports and by Grant No. SGS-2016-018 Data and Software Engineering for Advanced Applications. Computational resources were provided by the CESNET LM2015042 and the CERIT Scientific Cloud LM2015085, provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures”.
- 1.Agirre, E., et al.: Semeval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, pp. 497–511. Association for Computational Linguistics, June 2016Google Scholar
- 2.Svoboda, L., Brychcín, T.: New word analogy corpus for exploring embeddings of Czech words. arXiv preprint arXiv:1608.00789 (2016)
- 3.Krčmář, L., Konopík, M., Ježek, K.: Exploration of semantic spaces obtained from Czech corpora. In: Proceedings of the Dateso 2011: Annual International Workshop on DAtabases, TExts, Specifications and Objects, Pisek, Czech Republic, 20 April 2011, pp. 97–107 (2011)Google Scholar
- 5.Straková, J., Straka, M., Hajic, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: ACL (System Demonstrations), pp. 13–18 (2014)Google Scholar
- 7.Brychcín, T., Svoboda, L.: UWB at SemEval-2016 task 1: Semantic textual similarity using lexical, syntactic, and semantic information. In: Proceedings of SemEval, pp. 588–594 (2016)Google Scholar
- 9.Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1 (2013)Google Scholar
- 12.Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013)Google Scholar
- 13.Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)Google Scholar
- 14.Platt, J.: Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1998)Google Scholar
- 15.Levy, O., Søgaard, A., Goldberg, Y.: A strong baseline for learning cross-lingual word embeddings from sentence alignments. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Volume 1: Long Papers, vol. 1, pp. 765–774 (2017)Google Scholar
- 16.Zou, W.Y., Socher, R., Cer, D., Manning, C.D.: Bilingual word embeddings for phrase-based machine translation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1393–1398 (2013)Google Scholar
- 18.Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 363–372 (2015)Google Scholar
- 19.Gouws, S., Søgaard, A.: Simple task-specific bilingual word embeddings. In: HLT-NAACL, pp. 1386–1390 (2015)Google Scholar
- 20.Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015)