The impact of corpus domain on word representation: a study on Persian word embeddings

Hadifar, Amir; Momtazi, Saeedeh

doi:10.1007/s10579-018-9419-x

The impact of corpus domain on word representation: a study on Persian word embeddings

Project Notes
Published: 05 July 2018

Volume 52, pages 997–1019, (2018)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

426 Accesses
6 Citations
Explore all metrics

Abstract

Word embedding, has been a great success story for natural language processing in recent years. The main purpose of this approach is providing a vector representation of words based on neural network language modeling. Using a large training corpus, the model most learns from co-occurrences of words, namely Skip-gram model, and capture semantic features of words. Moreover, adding the recently introduced character embedding model to the objective function, the model can also focus on morphological features of words. In this paper, we study the impact of training corpus on the results of word embedding and show how the genre of training data affects the type of information captured by word embedding models. We perform our experiments on the Persian language. In line of our experiments, providing two well-known evaluation datasets for Persian, namely Google semantic/syntactic analogy and Wordsim353, is also part of the contribution of this paper. The experiments include computation of word embedding from various public Persian corpora with different genres and sizes while considering comprehensive lexical and semantic comparison between them. We identify words whose usages differ between these datasets resulted totally different vector representation which ends to significant impact on different domains in which the results vary up to 9% on Google analogy and up to 6% on Wordsim353. The resulted word embedding for each of the individual corpora as well as their combinations will be publicly available for any further research based on word embedding for Persian.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

We will make our scripts and models available upon publication.

References

Al-Rfou, R., Perozzi, B., & Skiena, S. (2013). Polyglot: Distributed word representations for multilingual NLP. In Proceedings of CoNLL (pp. 183–192).
AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., & Oroumchian, F. (2009). Hamshahri: A standard persian text collection. Knowledge Based Systems, 2, 382–387.
Article Google Scholar
AleAhmad, A., Zahedi, M. S., Rahgozar, M., & Moshiri, B. (2016). IrBlogs: A standard collection for studying Persian bloggers. Computers in Human Behavior, 57, 195–207.
Article Google Scholar
Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of ACL (pp. 238–247).
Basirat, A., & Joakim, N. (2016). Greedy universal dependency parsing with right singular word vectors. In Proceedings of SLTC.
Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Google Scholar
Brokos, G., Malakasiotis, P., & Androutsopoulos, I. (2016). Using centroids of word embeddings and word mover’s distance for biomedical document retrieval in question answering. In proceedings of BioNLP (pp. 114–118).
Camacho-Collados, J., et al. (2017). Semeval-2017 Task 2: Multilingual and cross-lingual semantic word similarity. In Proceedings of SemEval2017.
Cha, M., Gwon, Y., & Kung, H. T. (2017). Language modeling by clustering with word embeddings for text readability assessment. In Proceedings of CIKM (pp. 2003–2006).
Chen, X., Liu, Z., & Sun, M. (2014). A unified model for word sense representation and disambiguation. In Proceedings of EMNLP (pp. 1025–1035).
Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing. In Proceedings of ICML (pp. 160–167).
dos Santos, C. N., & Zadrozny, B. (2014). Learning character-level representations for part-of-speech tagging. In Proceedings of ICML (pp. 1818–1826).
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., et al. (2002). Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20, 116–131.
Article Google Scholar
Gharavi, E., Bijari, K., Zahirnia, K., & Veisi, H. (2016). A deep learning approach to persian plagiarism detection. In Proceedings of FIRE (pp. 154–159).
Iacobacci, I., Pilehvar, M. T., & Navigli, R. (2016). Embeddings for word sense disambiguation: An evaluation study. In Proceedings of ACL (pp. 897–907).
Kenter, T., & de Rijke, M. (2015). Short text similarity with word embeddings. In Proceedings of CIKM (pp. 1411–1420).
Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger K. Q., et al. (2015). From word embeddings to document distances. In Proceedings of ICML (pp. 957–966).
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. In Proceedings of NAACL-HLT (pp. 260–270).
Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, 211–225.
Google Scholar
Lin, C.-C., Ammar, W., Dyer, C., & Levin, L. (2015). Unsupervised POS induction with word embeddings (pp. 1311–1316).
Mikolov, T., Chen, K., Corrado, G., Dean, J.(2013a). Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS (pp. 1–9).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Efficient estimation of word representations in vector space (pp. 1–12). arXiv:1301.3781v3.
Mnih, A., Hinton, G. E. (2008). A scalable hierarchical distributed language model. In Proceedings of NIPS (pp. 1–8).
Passban, P., Qun, L., & Way, A. (2016). Boosting neural POS tagger for Farsi using morphological information. ACM Transactions on Asian and Low-Resource Language Information Processing, 16, 1–15.
Article Google Scholar
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of EMNLP (pp. 1532–1543).
Rehurek, R., & Petr, S. (2010). Software framework for topic modelling with large corpora. In Proceedings of LREC (pp. 45–50).
Turian, J., Ratinov, L., & Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning. In Proceedings of ACL (pp. 384–394).
Zamani, H., & Croft, W. B. (2016). Embedding-based query language models. In Proceedings of ICTIR (pp. 147–156).
Zhang, Y., Gaddy, D., Barzilay, R., & Jaakkola, T. S. (2016). Ten pairs to tag-multilingual pos tagging via coarse mapping between embeddings. In proceedings of NAACL-HLT (pp. 1307–1317).
Zuccon, G., Koopman, B., Bruza, P., & Azzopardi, L. (2015). Integrating and evaluating neural word embeddings in information retrieval. In Proceedings of ADCS (pp. 1–8).

Download references

Acknowledgements

Twitter dataset provided by Ali Shariat Bahadori from university of Tehran. Any usage and statement made herein are solely the responsibility of the authors.

Author information

Authors and Affiliations

Computer Engineering and Information Technology Department, Amirkabir University of Technology, Hafez Avenue, Tehran, Iran
Amir Hadifar & Saeedeh Momtazi

Authors

Amir Hadifar
View author publications
You can also search for this author in PubMed Google Scholar
Saeedeh Momtazi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Saeedeh Momtazi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hadifar, A., Momtazi, S. The impact of corpus domain on word representation: a study on Persian word embeddings. Lang Resources & Evaluation 52, 997–1019 (2018). https://doi.org/10.1007/s10579-018-9419-x

Download citation

Published: 05 July 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s10579-018-9419-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The impact of corpus domain on word representation: a study on Persian word embeddings

Abstract

Access this article

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation