Skip to main content

Czech Dataset for Semantic Textual Similarity

  • Conference paper
  • First Online:
  • 1433 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11107))

Abstract

Semantic textual similarity is the core shared task at the International Workshop on Semantic Evaluation (SemEval). It focuses on sentence meaning comparison. So far, most of the research has been devoted to English.

In this paper we present first Czech dataset for semantic textual similarity. The dataset contains 1425 manually annotated pairs. Czech is highly inflected language and is considered challenging for many natural language processing tasks. The dataset is publicly available for the research community.

In 2016 we participated at SemEval competition and our UWB system were ranked as second among 113 submitted systems in monolingual subtask and first among 26 systems in cross-lingual subtask.

We adapt the UWB system for Czech (originally for English) and experiment with new Czech dataset. Our system achieves very promising results and can serve as a strong baseline for future research.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    A black and white dog looking at the camera.

  2. 2.

    The black and white bull is looking at the camera.

References

  1. Agirre, E., et al.: Semeval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, pp. 497–511. Association for Computational Linguistics, June 2016

    Google Scholar 

  2. Svoboda, L., Brychcín, T.: New word analogy corpus for exploring embeddings of Czech words. arXiv preprint arXiv:1608.00789 (2016)

  3. Krčmář, L., Konopík, M., Ježek, K.: Exploration of semantic spaces obtained from Czech corpora. In: Proceedings of the Dateso 2011: Annual International Workshop on DAtabases, TExts, Specifications and Objects, Pisek, Czech Republic, 20 April 2011, pp. 97–107 (2011)

    Google Scholar 

  4. Cinková, S.: WordSim353 for Czech. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 190–197. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_22

    Chapter  Google Scholar 

  5. Straková, J., Straka, M., Hajic, J.: Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In: ACL (System Demonstrations), pp. 13–18 (2014)

    Google Scholar 

  6. Brychcín, T., Konopík, M.: HPS: high precision stemmer. Inf. Process. Manage. 51(1), 68–91 (2015)

    Article  Google Scholar 

  7. Brychcín, T., Svoboda, L.: UWB at SemEval-2016 task 1: Semantic textual similarity using lexical, syntactic, and semantic information. In: Proceedings of SemEval, pp. 588–594 (2016)

    Google Scholar 

  8. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  9. Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of Jaccard coefficient for keywords similarity. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1 (2013)

    Google Scholar 

  10. Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)

    Article  Google Scholar 

  11. Pelletier, F.J.: The principle of semantic compositionality. Topoi 13(1), 11–24 (1994)

    Article  MathSciNet  Google Scholar 

  12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013)

    Google Scholar 

  13. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  14. Platt, J.: Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1998)

    Google Scholar 

  15. Levy, O., Søgaard, A., Goldberg, Y.: A strong baseline for learning cross-lingual word embeddings from sentence alignments. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Volume 1: Long Papers, vol. 1, pp. 765–774 (2017)

    Google Scholar 

  16. Zou, W.Y., Socher, R., Cer, D., Manning, C.D.: Bilingual word embeddings for phrase-based machine translation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1393–1398 (2013)

    Google Scholar 

  17. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)

    Article  Google Scholar 

  18. Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 363–372 (2015)

    Google Scholar 

  19. Gouws, S., Søgaard, A.: Simple task-specific bilingual word embeddings. In: HLT-NAACL, pp. 1386–1390 (2015)

    Google Scholar 

  20. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015)

Download references

Acknowledgements

This work was supported by the project LO1506 of the Czech Ministry of Education, Youth and Sports and by Grant No. SGS-2016-018 Data and Software Engineering for Advanced Applications. Computational resources were provided by the CESNET LM2015042 and the CERIT Scientific Cloud LM2015085, provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lukás̆ Svoboda .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Svoboda, L., Brychcín, T. (2018). Czech Dataset for Semantic Textual Similarity. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00794-2_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00793-5

  • Online ISBN: 978-3-030-00794-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics