Abstract
The study deals with post-processing of a noisy collection of synsets created using crowdsourcing. First, we cluster long synsets in three different ways. Second, we apply four cluster cleaning techniques based either on word popularity or word embeddings. Evaluation shows that the method based on word embeddings and existing dictionary definitions delivers best results.
O. Antropova, E. Arslanova and M. Shaposhnikov contributed equally to the paper.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
Babenko, L.G.: The thesaurus dictionary of the Russian language synonyms, 2008.
- 4.
- 5.
An overview of dictionary data available for Russian can be found in [8].
- 6.
- 7.
- 8.
We use word2vec as a name of a general approach to word embeddings and to contrast it to the latter method that works with definitions.
References
Biemann, C.: Creating a system for lexical substitutions from scratch using crowdsourcing. Lang. Resour. Eval. 47(1), 97–122 (2013)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Braslavski, P., Ustalov, D., Mukhin, M., Kiselev, Y.: YARN: spinning-in-progress. In: GWC, pp. 58–65 (2016)
Braslavski, P., Ustalov, D., Mukhin, M.: A spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: EACL (demo), pp. 101–104 (2014)
Fellbaum, C.: Wordnet: An Electronic Database. MIT Press, Cambridge (1998)
Gurevych, I., Kim, J. (eds.): The People’s Web Meets NLP. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35085-6
Kiselev, Y., Ustalov, D., Porshnev, S.: Eliminating fuzzy duplicates in crowdsourced lexical resources. In: GWC, pp. 161–167 (2016)
Kiselev, Y., et al.: Russian lexicographic landscape: a tale of 12 dictionaries. In: Dialogue, pp. 254–271 (2015)
Kutuzov, A., Kuzmenko, E.: Webvectors: a toolkit for building web interfaces for vector semantic models. In: AIST, pp. 155–161 (2017)
Ustalov, D., Panchenko, A., Biemann, C.: Watset: automatic induction of synsets from a graph of synonyms. In: ACL, pp. 1579–1590 (2017)
Acknowledgments
PB was supported by RFH grant #16-04-12019, OA was supported by RFBR according to the research project No. 18-312-00129.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Antropova, O., Arslanova, E., Shaposhnikov, M., Braslavski, P., Mukhin, M. (2018). Cleaning Up After a Party: Post-processing Thesaurus Crowdsourced Data. In: Ustalov, D., Filchenkov, A., Pivovarova, L., Žižka, J. (eds) Artificial Intelligence and Natural Language. AINL 2018. Communications in Computer and Information Science, vol 930. Springer, Cham. https://doi.org/10.1007/978-3-030-01204-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-01204-5_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01203-8
Online ISBN: 978-3-030-01204-5
eBook Packages: Computer ScienceComputer Science (R0)