Fighting with the Sparsity of Synonymy Dictionaries for Automatic Synset Induction

Ustalov, Dmitry; Chernoskutov, Mikhail; Biemann, Chris; Panchenko, Alexander

doi:10.1007/978-3-319-73013-4_9

Fighting with the Sparsity of Synonymy Dictionaries for Automatic Synset Induction

Dmitry Ustalov^25,26,
Mikhail Chernoskutov^25,26,
Chris Biemann²⁷ &
…
Alexander Panchenko²⁷

Conference paper
First Online: 21 December 2017

2273 Accesses
2 Citations
3 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10716))

Abstract

Graph-based synset induction methods, such as MaxMax and Watset, induce synsets by performing a global clustering of a synonymy graph. However, such methods are sensitive to the structure of the input synonymy graph: sparseness of the input dictionary can substantially reduce the quality of the extracted synsets. In this paper, we propose two different approaches designed to alleviate the incompleteness of the input dictionaries. The first one performs a pre-processing of the graph by adding missing edges, while the second one performs a post-processing by merging similar synset clusters. We evaluate these approaches on two datasets for the Russian language and discuss their impact on the performance of synset induction methods. Finally, we perform an extensive error analysis of each approach and discuss prominent alternative methods for coping with the problem of sparsity of the synonymy dictionaries.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
In the context of this work, we assume that synonymy is a relation of lexical semantic equivalence which is context-independent, as opposed to “contextual synonyms” [32].
2.
http://ontopt.dei.uc.pt.
3.
In general, the \(m\text {-}k\textit{NN}\) method can be parametrized by two different parameters: \(k_{ij}\) – the number of nearest neighbors from the word i to the word j and \(k_{ji}\) – the number of nearest neighbors from the word j to the word i. In our case, for simplicity, we set \(k_{ij} = k_{ji} = k\).
4.
http://russe.nlpub.ru/downloads.

References

Biemann, C.: Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems. In: Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, pp. 73–80. TextGraphs-1. Association for Computational Linguistics, New York (2006)
Google Scholar
Braslavski, P., Ustalov, D., Mukhin, M., Kiselev, Y.: YARN: spinning-in-progress. In: Proceedings of the 8th Global WordNet Conference (GWC 2016), pp. 58–65. Global WordNet Association, Bucharest (2016)
Google Scholar
Van Dongen, S.: Graph Clustering by Flow Simulation. Ph.D. thesis, University of Utrecht (2000)
Google Scholar
Dorow, B., Widdows, D.: Discovering corpus-specific word senses. In: Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics (EACL 2003), vol. 2, pp. 79–82. Association for Computational Linguistics, Budapest (2003)
Google Scholar
Fellbaum, C.: WordNet: An Electronic Database. MIT Press, Cambridge (1998)
MATH Google Scholar
Feuerbach, T., Riedl, M., Biemann, C.: Distributional semantics for resolving bridging mentions. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 192–199. INCOMA Ltd., Shoumen, Hissar (2015)
Google Scholar
Gfeller, D., Chappelier, J.C., De Los Rios, P.: Synonym dictionary improvement through markov clustering and clustering stability. In: Proceedings of the International Symposium on Applied Stochastic Models and Data Analysis, pp. 106–113, Brest, France (2005)
Google Scholar
Gonçalo Oliveira, H., Gomes, P.: ECO and Onto.PT: a flexible approach for creating a Portuguese wordnet automatically. Lang. Resour. Eval. 48(2), 373–393 (2014)
Article Google Scholar
Gurevych, I., Kim, J. (eds.): The People’s Web Meets NLP: Collaboratively Constructed Language Resources. Theory and Applications of Natural Language Processing. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35085-6
Google Scholar
Herrmann, D.J.: An old problem for the new psycho-semantics: synonymity. Psychol. Bull. 85(3), 490–512 (1978)
Article MathSciNet Google Scholar
Heylen, K., Peirsman, Y., Geeraerts, D., Speelman, D.: Modelling word similarity: an evaluation of automatic synonymy extraction algorithms. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), pp. 3243–3249. European Language Resources Association, Marrakech (2008)
Google Scholar
Hope, D., Keller, B.: MaxMax: a graph-based soft clustering algorithm applied to word sense induction. In: Gelbukh, A. (ed.) CICLing 2013. LNCS, vol. 7816, pp. 368–381. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37247-6_30
Chapter Google Scholar
Lappin, S., Leass, H.J.: An algorithm for pronominal anaphora resolution. Comput. Linguist. 20(4), 535–561 (1994)
Google Scholar
Loukachevitch, N.V.: Thesauri in Information Retrieval Tasks. Moscow University Press, Moscow (2011). (in Russian)
Google Scholar
Loukachevitch, N.V., Lashevich, G., Gerasimova, A.A., Ivanov, V.V., Dobrov, B.V.: Creating Russian wordnet by conversion. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference “Dialogue”, pp. 405–415. RSUH, Moscow (2016)
Google Scholar
Manandhar, S., Klapaftis, I., Dligach, D., Pradhan, S.: SemEval-2010 Task 14: word sense induction & disambiguation. In: Proceedings of the 5th International Workshop on Semantic Evaluation, pp. 63–68. Association for Computational Linguistics, Uppsala (2010)
Google Scholar
Meyer, C.M., Gurevyich, I.: OntoWiktionary: Constructing an Ontology from the Collaborative Online Dictionary Wiktionary, pp. 131–161. IGI Global, Hershey (2012)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, vol. 26, pp. 3111–3119. Curran Associates Inc., Harrahs and Harveys (2013)
Google Scholar
Navigli, R.: A quick tour of word sense disambiguation, induction and related approaches. In: Bieliková, M., Friedrich, G., Gottlob, G., Katzenbeisser, S., Turán, G. (eds.) SOFSEM 2012. LNCS, vol. 7147, pp. 115–129. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27660-6_10
Chapter Google Scholar
Panchenko, A.: Comparison of the baseline knowledge-, corpus-, and web-based similarity measures for semantic relations extraction. In: Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics (GEMS 2011), pp. 11–21. Association for Computational Linguistics, Edinburgh (2011)
Google Scholar
Panchenko, A., Adeykin, S., Romanov, A., Romanov, P.: Extraction of semantic relations between concepts with KNN algorithms on Wikipedia. In: Proceedings of the 2nd International Workshop on Concept Discovery in Unstructured Data, pp. 78–86, no. 871 in CEUR Workshop Proceedings, Leuven, Belgium (2012)
Google Scholar
Panchenko, A., Morozova, O., Naets, H.: A semantic similarity measure based on lexico-syntactic patterns. In: Proceedings of KONVENS 2012, pp. 174–178, ÖGAI (2012)
Google Scholar
Panchenko, A., Simon, J., Riedl, M., Biemann, C.: Noun sense induction and disambiguation using graph-based distributional semantics. In: Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), pp. 192–202. Bochumer Linguistische Arbeitsberichte (2016)
Google Scholar
Panchenko, A., Ustalov, D., Arefyev, N., Paperno, D., Konstantinova, N., Loukachevitch, N., Biemann, C.: Human and machine judgements for Russian semantic relatedness. In: Ignatov, D., et al. (eds.) AIST 2016. CCIS, vol. 661, pp. 221–235. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-52920-2_21
Chapter Google Scholar
Peirsman, Y., Heylen, K., Speelman, D.: Putting things in order. First and second order context models for the calculation of semantic similarity. In: Proceedings of the 9th Journées internationales d’Analyse statistique des Données Textuelles (JADT 2008), pp. 907–916, Lyon, France (2008)
Google Scholar
Pelevina, M., Arefyev, N., Biemann, C., Panchenko, A.: Making sense of word embeddings. In: Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 174–183. Association for Computational Linguistics, Berlin (2016)
Google Scholar
Seitner, J., Bizer, C., Eckert, K., Faralli, S., Meusel, R., Paulheim, H., Ponzetto, S.P.: A large database of hypernymy relations extracted from the web. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 360–367. European Language Resources Association (ELRA), Portorož (2016)
Google Scholar
Shwartz, V., Goldberg, Y., Dagan, I.: Improving hypernymy detection with an integrated path-based and distributional method. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 2389–2398. Association for Computational Linguistics, Berlin (2016)
Google Scholar
Snow, R., Jurafsky, D., Ng, A.Y.: Learning syntactic patterns for automatic hypernym discovery. In: Proceedings of the 17th International Conference on Neural Information Processing Systems (NIPS 2004), pp. 1297–1304. MIT Press, Vancouver (2004)
Google Scholar
Ustalov, D., Panchenko, A., Biemann, C.: Watset: automatic induction of synsets from a graph of synonyms. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 1579–1590. Association for Computational Linguistics, Vancouver (2017)
Google Scholar
Wandmacher, T.: How semantic is latent semantic analysis? In: Proceedings of RÉCITAL 2005. pp. 525–534, Dourdan, France (2005)
Google Scholar
Zeng, X.M.: Semantic relationships between contextual synonyms. US-China Educ. Rev. 4(9), 33–37 (2007)
Google Scholar
Zipf, G.K.: The Psycho-Biology of Language, Houghton, Mifflin, Oxford, England (1935)
Google Scholar

Download references

Acknowledgements

We acknowledge the support of the Deutsche Forschungsgemeinschaft (DFG) under the “JOIN-T” project, the DAAD, the RFBR under the projects no. 16-37-00203 and no. 16-37-00354 and the RFH under the project no. 16-04-12019. The research was supported by the Ministry of Education and Science of the Russian Federation Agreement no. 02.A03.21.0006. The calculations were carried out using the supercomputer “Uran” at the Krasovskii Institute of Mathematics and Mechanics. Finally, we also thank four anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Ural Federal University, Yekaterinburg, Russia
Dmitry Ustalov & Mikhail Chernoskutov
Krasovskii Institute of Mathematics and Mechanics, Yekaterinburg, Russia
Dmitry Ustalov & Mikhail Chernoskutov
Universität Hamburg, Hamburg, Germany
Chris Biemann & Alexander Panchenko

Authors

Dmitry Ustalov
View author publications
You can also search for this author in PubMed Google Scholar
Mikhail Chernoskutov
View author publications
You can also search for this author in PubMed Google Scholar
Chris Biemann
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Panchenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dmitry Ustalov .

Editor information

Editors and Affiliations

Eindhoven University of Technology, Eindhoven, The Netherlands
Wil M.P. van der Aalst
National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
Krasovsky Institute of Mathematics and Mechanics, Ekaterinburg, Russia
Michael Khachay
National Research University Higher School of Economics, Moscow, Russia
Sergei O. Kuznetsov
Skolkovo Institute of Science and Technology, Moscow, Russia
Victor Lempitsky
National Research University Higher School of Economics, Moscow, Russia
Irina A. Lomazova
Moscow State University, Moscow, Russia
Natalia Loukachevitch
LORIA, Campus Scientifique, Vandœuvre lès Nancy, France
Amedeo Napoli
University of Hamburg, Hamburg, Germany
Alexander Panchenko
University of Florida, Gainesville, Florida, USA
Panos M. Pardalos
National Research University Higher School of Economics, Nizhny Novgorod, Russia
Andrey V. Savchenko
Indiana University, Bloomington, Indiana, USA
Stanley Wasserman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ustalov, D., Chernoskutov, M., Biemann, C., Panchenko, A. (2018). Fighting with the Sparsity of Synonymy Dictionaries for Automatic Synset Induction. In: van der Aalst, W., et al. Analysis of Images, Social Networks and Texts. AIST 2017. Lecture Notes in Computer Science(), vol 10716. Springer, Cham. https://doi.org/10.1007/978-3-319-73013-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-73013-4_9
Published: 21 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73012-7
Online ISBN: 978-3-319-73013-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics