Abstract
WordNet is one of the most used resources in Natural Language Processing (NLP). However, the only WordNet available for Spanish is mainly representative of Spain and its size is approximately 50 % compared to Princeton’s WordNet in English. To address these issues, we automatically translate the Princeton version using lemmas and sentences from all the available corpora annotated with WordNet senses (LAS-WordNet). In addition, we enrich the translated version using lexicons that contain Pan-Hispanic regionalisms extracted from Twitter (LAR-WordNet). The proposed resources were evaluated in the task of Semantic Textual Similarity in Spanish and cross-lingual between Spanish and English. The results showed that LAS-WordNet significantly outperformed the current Spanish WordNet and that the regionalisms added to LAR-WordNet do not hinder its performance. Although the proposed resources are noisier than the current WordNet in Spanish, their size and representativeness make them suitable for many NLP applications.
Supported by the Asociación de Amigos del Instituto Caro y Cuervo.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
LAS-WordNet and LAR-WordNet are available at https://www.datos.gov.co/browse?q=wordnet.
- 5.
- 6.
STS Benchmark http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark.
- 7.
References
Agirre, E., et al.: Semeval-2015 task 2: Semantic textual similarity, English, Spanish and pilot on interpretability. In: Proceedings of SemEval 2015, pp. 252–263. ACL (2015)
Agirre, E., et al.: Semeval-2014 task 10: Multilingual semantic textual similarity. In: Proceedings of SemEval 2014, pp. 81–91. ACL and Dublin City University (2014)
Agirre, E., et al.: Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of SemEval-2016, pp. 497–511. ACL (2016)
Bird, S., Loper, E.: NLTK: the natural language toolkit. In: Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions, p. 31. ACL (2004)
Bond, F., et al.: Open multilingual wordnet. Web page of the resource and project (2013). http://compling.hss.ntu.edu.sg/omw/
Calvo, H.: Simple TF\(\cdot \)IDF is not the best you can get for regionalism classification. In: Gelbukh, A. (ed.) CICLing 2014. LNCS, vol. 8403, pp. 92–101. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54906-9_8
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of SemEval-2017, pp. 1–14. ACL (2017)
Edmonds, P., Cotton, S.: Senseval-2: overview. In: The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems, pp. 1–5. ACL (2001)
Fellbaum, C.: WordNet. Wiley, Hoboken (1998)
Fernández-Montraveta, A., Vázquez, G., Fellbaum, C.: The Spanish version of WordNet 3.0. Mouton de Gruyter, Berlin
Gonzalez-Agirre, A., Laparra, E., Rigau, G.: Multilingual central repository version 3.0. In: LREC, pp. 2525–2529 (2012)
Jimenez, S., Becerra, C., Gelbukh, A., Gonzalez, F.: Generalized Mongue-Elkan method for approximate text string comparison. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 559–570. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00382-0_45
Jimenez, S., Dueñas, G., Gelbukh, A., Rodriguez-Diaz, C.A., Mancera, S.: Automatic detection of regional words from twitter for the Pan-Hispanic Spanish (2018, to appear)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Lang. Cogn. Processes 6(1), 1–28 (1991)
Miller, G.A., Chodorow, M., Landes, S., Leacock, C., Thomas, R.G.: Using a semantic concordance for sense identification. In: Proceedings of the workshop on Human Language Technology, pp. 240–243. ACL (1994)
Monge, A.E., Elkan, C., et al.: The field matching problem: algorithms and applications. In: KDD, pp. 267–270 (1996)
Moro, A., Navigli, R.: Semeval-2015 task 13: multilingual all-words sense disambiguation and entity linking. In: Proceedings of SemEval 2015, pp. 288–297. ACL (2015)
Navigli, R., Jurgens, D., Vannella, D.: Semeval-2013 task 12: multilingual word sense disambiguation. In: Proceedings of SemEval 2013. vol. 2, pp. 222–231. ACL (2013)
Navigli, R., Ponzetto, S.P.: Babelnet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Oliver, A., Climent, S.: Parallel corpora for wordnet construction: machine translation vs. automatic sense tagging. In: Gelbukh, A. (ed.) CICLing 2012. LNCS, vol. 7182, pp. 110–121. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28601-8_10
Pedersen, T., Pakhomov, S.V., Patwardhan, S., Chute, C.G.: Measures of semantic similarity and relatedness in the biomedical domain. J. Biomed. Inform. 40(3), 288–299 (2007)
Pianta, E., Bentivogli, L., Girardi, C.: Multiwordnet: developing an aligned multilingual database. 1st GWC. In: Proceedings of the First International Conference on Global WordNet, Mysore, India, pp. 293–302 (2002)
Pradhan, S.S., Loper, E., Dligach, D., Palmer, M.: Semeval-2007 task 17: English lexical sample, SRL and all words. In: Proceedings of the 4th International Workshop on Semantic Evaluations, pp. 87–92. ACL (2007)
Raganato, A., Camacho-Collados, J., Navigli, R.: Word sense disambiguation: a unified evaluation framework and empirical comparison. In: Proceedings of the 15th Conference of the European Chapter of the ACL. vol. 1, pp. 99–110 (2017)
Snyder, B., Palmer, M.: The English all-words task. In: Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (2004)
Taghipour, K., Ng, H.T.: One million sense-tagged instances for word sense disambiguation and induction. In: Proceedings of the Nineteenth Conference on Computational Natural Language Learning, pp. 338–344 (2015)
Vossen, P.: Eurowordnet: a multilingual database of autonomous and language-specific wordnets connected via an inter-lingualindex. Int. J. Lexicography 17(2), 161–173 (2004)
Wu, Y., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Jimenez, S., Dueñas, G. (2018). LAR-WordNet: A Machine-Translated, Pan-Hispanic and Regional WordNet for Spanish. In: Simari, G., Fermé, E., Gutiérrez Segura, F., Rodríguez Melquiades, J. (eds) Advances in Artificial Intelligence - IBERAMIA 2018. IBERAMIA 2018. Lecture Notes in Computer Science(), vol 11238. Springer, Cham. https://doi.org/10.1007/978-3-030-03928-8_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-03928-8_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03927-1
Online ISBN: 978-3-030-03928-8
eBook Packages: Computer ScienceComputer Science (R0)