Cleaning Up After a Party: Post-processing Thesaurus Crowdsourced Data

Antropova, Oksana; Arslanova, Elena; Shaposhnikov, Maxim; Braslavski, Pavel; Mukhin, Mikhail

doi:10.1007/978-3-030-01204-5_13

Oksana Antropova¹²,
Elena Arslanova¹²,
Maxim Shaposhnikov¹²,
Pavel Braslavski^12,13,14 &
…
Mikhail Mukhin¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 930))

Included in the following conference series:

Conference on Artificial Intelligence and Natural Language

769 Accesses

Abstract

The study deals with post-processing of a noisy collection of synsets created using crowdsourcing. First, we cluster long synsets in three different ways. Second, we apply four cluster cleaning techniques based either on word popularity or word embeddings. Evaluation shows that the method based on word embeddings and existing dictionary definitions delivers best results.

O. Antropova, E. Arslanova and M. Shaposhnikov contributed equally to the paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://russianword.net/editor.
2.
https://russianword.net/data.
3.
Babenko, L.G.: The thesaurus dictionary of the Russian language synonyms, 2008.
4.
https://ru.wiktionary.org/.
5.
An overview of dictionary data available for Russian can be found in [8].
6.
http://rusvectores.org/ru/models/.
7.
https://radimrehurek.com/gensim/.
8.
We use word2vec as a name of a general approach to word embeddings and to contrast it to the latter method that works with definitions.

References

Biemann, C.: Creating a system for lexical substitutions from scratch using crowdsourcing. Lang. Resour. Eval. 47(1), 97–122 (2013)
Article Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Braslavski, P., Ustalov, D., Mukhin, M., Kiselev, Y.: YARN: spinning-in-progress. In: GWC, pp. 58–65 (2016)
Google Scholar
Braslavski, P., Ustalov, D., Mukhin, M.: A spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: EACL (demo), pp. 101–104 (2014)
Google Scholar
Fellbaum, C.: Wordnet: An Electronic Database. MIT Press, Cambridge (1998)
MATH Google Scholar
Gurevych, I., Kim, J. (eds.): The People’s Web Meets NLP. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35085-6
Book Google Scholar
Kiselev, Y., Ustalov, D., Porshnev, S.: Eliminating fuzzy duplicates in crowdsourced lexical resources. In: GWC, pp. 161–167 (2016)
Google Scholar
Kiselev, Y., et al.: Russian lexicographic landscape: a tale of 12 dictionaries. In: Dialogue, pp. 254–271 (2015)
Google Scholar
Kutuzov, A., Kuzmenko, E.: Webvectors: a toolkit for building web interfaces for vector semantic models. In: AIST, pp. 155–161 (2017)
Google Scholar
Ustalov, D., Panchenko, A., Biemann, C.: Watset: automatic induction of synsets from a graph of synonyms. In: ACL, pp. 1579–1590 (2017)
Google Scholar

Download references

Acknowledgments

PB was supported by RFH grant #16-04-12019, OA was supported by RFBR according to the research project No. 18-312-00129.

Author information

Authors and Affiliations

Ural Federal University, Yekaterinburg, Russia
Oksana Antropova, Elena Arslanova, Maxim Shaposhnikov, Pavel Braslavski & Mikhail Mukhin
JetBrains Research, Saint Petersburg, Russia
Pavel Braslavski
National Research University Higher School of Economics, Saint Petersburg, Russia
Pavel Braslavski

Authors

Oksana Antropova
View author publications
You can also search for this author in PubMed Google Scholar
Elena Arslanova
View author publications
You can also search for this author in PubMed Google Scholar
Maxim Shaposhnikov
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Braslavski
View author publications
You can also search for this author in PubMed Google Scholar
Mikhail Mukhin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Oksana Antropova .

Editor information

Editors and Affiliations

Data and Web Science Group, University of Mannheim, Mannheim, Baden-Württemberg, Germany
Dmitry Ustalov
ITMO University, St. Petersburg, Russia
Andrey Filchenkov
University of Helsinki, Helsinki, Finland
Lidia Pivovarova
Mendel University, Brno, Czech Republic
Jan Žižka

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Antropova, O., Arslanova, E., Shaposhnikov, M., Braslavski, P., Mukhin, M. (2018). Cleaning Up After a Party: Post-processing Thesaurus Crowdsourced Data. In: Ustalov, D., Filchenkov, A., Pivovarova, L., Žižka, J. (eds) Artificial Intelligence and Natural Language. AINL 2018. Communications in Computer and Information Science, vol 930. Springer, Cham. https://doi.org/10.1007/978-3-030-01204-5_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-01204-5_13
Published: 27 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01203-8
Online ISBN: 978-3-030-01204-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics