Abstract
Word embedding, has been a great success story for natural language processing in recent years. The main purpose of this approach is providing a vector representation of words based on neural network language modeling. Using a large training corpus, the model most learns from co-occurrences of words, namely Skip-gram model, and capture semantic features of words. Moreover, adding the recently introduced character embedding model to the objective function, the model can also focus on morphological features of words. In this paper, we study the impact of training corpus on the results of word embedding and show how the genre of training data affects the type of information captured by word embedding models. We perform our experiments on the Persian language. In line of our experiments, providing two well-known evaluation datasets for Persian, namely Google semantic/syntactic analogy and Wordsim353, is also part of the contribution of this paper. The experiments include computation of word embedding from various public Persian corpora with different genres and sizes while considering comprehensive lexical and semantic comparison between them. We identify words whose usages differ between these datasets resulted totally different vector representation which ends to significant impact on different domains in which the results vary up to 9% on Google analogy and up to 6% on Wordsim353. The resulted word embedding for each of the individual corpora as well as their combinations will be publicly available for any further research based on word embedding for Persian.
Notes
We will make our scripts and models available upon publication.
References
Al-Rfou, R., Perozzi, B., & Skiena, S. (2013). Polyglot: Distributed word representations for multilingual NLP. In Proceedings of CoNLL (pp. 183–192).
AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., & Oroumchian, F. (2009). Hamshahri: A standard persian text collection. Knowledge Based Systems, 2, 382–387.
AleAhmad, A., Zahedi, M. S., Rahgozar, M., & Moshiri, B. (2016). IrBlogs: A standard collection for studying Persian bloggers. Computers in Human Behavior, 57, 195–207.
Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of ACL (pp. 238–247).
Basirat, A., & Joakim, N. (2016). Greedy universal dependency parsing with right singular word vectors. In Proceedings of SLTC.
Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Brokos, G., Malakasiotis, P., & Androutsopoulos, I. (2016). Using centroids of word embeddings and word mover’s distance for biomedical document retrieval in question answering. In proceedings of BioNLP (pp. 114–118).
Camacho-Collados, J., et al. (2017). Semeval-2017 Task 2: Multilingual and cross-lingual semantic word similarity. In Proceedings of SemEval2017.
Cha, M., Gwon, Y., & Kung, H. T. (2017). Language modeling by clustering with word embeddings for text readability assessment. In Proceedings of CIKM (pp. 2003–2006).
Chen, X., Liu, Z., & Sun, M. (2014). A unified model for word sense representation and disambiguation. In Proceedings of EMNLP (pp. 1025–1035).
Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing. In Proceedings of ICML (pp. 160–167).
dos Santos, C. N., & Zadrozny, B. (2014). Learning character-level representations for part-of-speech tagging. In Proceedings of ICML (pp. 1818–1826).
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., et al. (2002). Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20, 116–131.
Gharavi, E., Bijari, K., Zahirnia, K., & Veisi, H. (2016). A deep learning approach to persian plagiarism detection. In Proceedings of FIRE (pp. 154–159).
Iacobacci, I., Pilehvar, M. T., & Navigli, R. (2016). Embeddings for word sense disambiguation: An evaluation study. In Proceedings of ACL (pp. 897–907).
Kenter, T., & de Rijke, M. (2015). Short text similarity with word embeddings. In Proceedings of CIKM (pp. 1411–1420).
Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger K. Q., et al. (2015). From word embeddings to document distances. In Proceedings of ICML (pp. 957–966).
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. In Proceedings of NAACL-HLT (pp. 260–270).
Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, 211–225.
Lin, C.-C., Ammar, W., Dyer, C., & Levin, L. (2015). Unsupervised POS induction with word embeddings (pp. 1311–1316).
Mikolov, T., Chen, K., Corrado, G., Dean, J.(2013a). Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS (pp. 1–9).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Efficient estimation of word representations in vector space (pp. 1–12). arXiv:1301.3781v3.
Mnih, A., Hinton, G. E. (2008). A scalable hierarchical distributed language model. In Proceedings of NIPS (pp. 1–8).
Passban, P., Qun, L., & Way, A. (2016). Boosting neural POS tagger for Farsi using morphological information. ACM Transactions on Asian and Low-Resource Language Information Processing, 16, 1–15.
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of EMNLP (pp. 1532–1543).
Rehurek, R., & Petr, S. (2010). Software framework for topic modelling with large corpora. In Proceedings of LREC (pp. 45–50).
Turian, J., Ratinov, L., & Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning. In Proceedings of ACL (pp. 384–394).
Zamani, H., & Croft, W. B. (2016). Embedding-based query language models. In Proceedings of ICTIR (pp. 147–156).
Zhang, Y., Gaddy, D., Barzilay, R., & Jaakkola, T. S. (2016). Ten pairs to tag-multilingual pos tagging via coarse mapping between embeddings. In proceedings of NAACL-HLT (pp. 1307–1317).
Zuccon, G., Koopman, B., Bruza, P., & Azzopardi, L. (2015). Integrating and evaluating neural word embeddings in information retrieval. In Proceedings of ADCS (pp. 1–8).
Acknowledgements
Twitter dataset provided by Ali Shariat Bahadori from university of Tehran. Any usage and statement made herein are solely the responsibility of the authors.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hadifar, A., Momtazi, S. The impact of corpus domain on word representation: a study on Persian word embeddings. Lang Resources & Evaluation 52, 997–1019 (2018). https://doi.org/10.1007/s10579-018-9419-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-018-9419-x