Skip to main content

Cleaning Up After a Party: Post-processing Thesaurus Crowdsourced Data

  • Conference paper
  • First Online:
  • 768 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 930))

Abstract

The study deals with post-processing of a noisy collection of synsets created using crowdsourcing. First, we cluster long synsets in three different ways. Second, we apply four cluster cleaning techniques based either on word popularity or word embeddings. Evaluation shows that the method based on word embeddings and existing dictionary definitions delivers best results.

O. Antropova, E. Arslanova and M. Shaposhnikov contributed equally to the paper.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://russianword.net/editor.

  2. 2.

    https://russianword.net/data.

  3. 3.

    Babenko, L.G.: The thesaurus dictionary of the Russian language synonyms, 2008.

  4. 4.

    https://ru.wiktionary.org/.

  5. 5.

    An overview of dictionary data available for Russian can be found in [8].

  6. 6.

    http://rusvectores.org/ru/models/.

  7. 7.

    https://radimrehurek.com/gensim/.

  8. 8.

    We use word2vec as a name of a general approach to word embeddings and to contrast it to the latter method that works with definitions.

References

  1. Biemann, C.: Creating a system for lexical substitutions from scratch using crowdsourcing. Lang. Resour. Eval. 47(1), 97–122 (2013)

    Article  Google Scholar 

  2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)

  3. Braslavski, P., Ustalov, D., Mukhin, M., Kiselev, Y.: YARN: spinning-in-progress. In: GWC, pp. 58–65 (2016)

    Google Scholar 

  4. Braslavski, P., Ustalov, D., Mukhin, M.: A spinning wheel for YARN: user interface for a crowdsourced thesaurus. In: EACL (demo), pp. 101–104 (2014)

    Google Scholar 

  5. Fellbaum, C.: Wordnet: An Electronic Database. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  6. Gurevych, I., Kim, J. (eds.): The People’s Web Meets NLP. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35085-6

    Book  Google Scholar 

  7. Kiselev, Y., Ustalov, D., Porshnev, S.: Eliminating fuzzy duplicates in crowdsourced lexical resources. In: GWC, pp. 161–167 (2016)

    Google Scholar 

  8. Kiselev, Y., et al.: Russian lexicographic landscape: a tale of 12 dictionaries. In: Dialogue, pp. 254–271 (2015)

    Google Scholar 

  9. Kutuzov, A., Kuzmenko, E.: Webvectors: a toolkit for building web interfaces for vector semantic models. In: AIST, pp. 155–161 (2017)

    Google Scholar 

  10. Ustalov, D., Panchenko, A., Biemann, C.: Watset: automatic induction of synsets from a graph of synonyms. In: ACL, pp. 1579–1590 (2017)

    Google Scholar 

Download references

Acknowledgments

PB was supported by RFH grant #16-04-12019, OA was supported by RFBR according to the research project No. 18-312-00129.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oksana Antropova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Antropova, O., Arslanova, E., Shaposhnikov, M., Braslavski, P., Mukhin, M. (2018). Cleaning Up After a Party: Post-processing Thesaurus Crowdsourced Data. In: Ustalov, D., Filchenkov, A., Pivovarova, L., Žižka, J. (eds) Artificial Intelligence and Natural Language. AINL 2018. Communications in Computer and Information Science, vol 930. Springer, Cham. https://doi.org/10.1007/978-3-030-01204-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01204-5_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01203-8

  • Online ISBN: 978-3-030-01204-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics