Advertisement

International Journal of Speech Technology

, Volume 21, Issue 1, pp 121–136 | Cite as

Improving Arabic information retrieval using word embedding similarities

  • Abdelkader El Mahdaouy
  • Saïd Ouatik El Alaoui
  • Eric Gaussier
Article
  • 180 Downloads

Abstract

Term mismatch is a common limitation of traditional information retrieval (IR) models where relevance scores are estimated based on exact matching of documents and queries. Typically, good IR model should consider distinct but semantically similar words in the matching process. In this paper, we propose a method to incorporate word embedding (WE) semantic similarities into existing probabilistic IR models for Arabic in order to deal with term mismatch. Experiments are performed on the standard Arabic TREC collection using three neural word embedding models. The results show that extending the existing IR models improves significantly baseline bag-of-words models. Although the proposed extensions significantly outperform their baseline bag-of-words, the difference between the evaluated neural word embedding models is not statistically significant. Moreover, the overall comparison results show that our extensions significantly improve the Arabic WordNet based semantic indexing approach and three recent WE-based IR language models.

Keywords

Arabic information retrieval Term mismatch Word embedding Semantic similarity 

References

  1. Abdelali, A., Darwish, K., Durrani, N., Mubarak, H. (2016). Farasa: A fast and furious segmenter for Arabic. In Proceedings of the Demonstrations Session, NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 11–16). San Diego, CA, June 12–17, 2016.Google Scholar
  2. Abderrahim, M. A., Dib, M., Abderrahim, M. E. A., & Chikh, M. A. (2016). Semantic indexing of arabic texts for information retrieval system. International Journal of Speech Technology, 19(2), 229–236.CrossRefGoogle Scholar
  3. Abouenour, L., Bouzoubaa, K., & Rosso, P. (2013). On the evaluation and improvement of Arabic wordnet coverage and usability. Language Resources and Evaluation, 47(3), 891–917.CrossRefGoogle Scholar
  4. Abu El-Khair, I. (2007). Arabic information retrieval. Annual Review of Information Science and Technology, 41(1), 505–533.CrossRefGoogle Scholar
  5. Algarni, M., Martin, B., Bell, T., Neshatian, K. (2014). Simple arabic stemmer. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’14 (pp. 1803–1806).Google Scholar
  6. Amati, G., & Van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), 357–389.CrossRefGoogle Scholar
  7. Atwan, J., Mohd, M., Rashaideh, H., & Kanaan, G. (2016). Semantically enhanced pseudo relevance feedback for arabic information retrieval. Journal of Information Science, 42(2), 246–260.CrossRefGoogle Scholar
  8. Baroni, M., Dinu, G., Kruszewski, G. (2014) Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL (pp. 238–247), Baltimore, MA.Google Scholar
  9. Belalem, G., Abbache, A., Barigou, F., & Belkredim, F. Z. (2014). The use of arabic wordnet in arabic information retrieval. International Journal of Information Retrieval Research, 4(3), 54–65.CrossRefGoogle Scholar
  10. Ben Guirat, S., Bounhas, I., & Slimani, Y. (2016). Combining indexing units for arabic information retrieval. International Journal of Software Innovation, 4(4), 1–14.CrossRefGoogle Scholar
  11. Berger, A., Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp 222–229), New York: SIGIR ’99.Google Scholar
  12. Boulaknadel, S., Daille, B., Aboutajdine, D. (2008). Multi-word term indexing for Arabic document retrieval. In IEEE Symposium on Computers and Communications (ISCC’08) (pp. 869–873).Google Scholar
  13. Clinchant, S., Gaussier, E. (2010). Information-based models for ad hoc IR. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 234–241), New York: SIGIR ’10.Google Scholar
  14. Clinchant, S., & Gaussier, E. (2011). Retrieval constraints and word frequency distributions a log-logistic model for IR. Information Retrieval, 14(1), 5–25.CrossRefGoogle Scholar
  15. Croft, W. B., Bendersky, M., Li, H., & Xu, G. (2011). Query representation and understanding workshop. SIGIR Forum, 44(2), 48–53.CrossRefGoogle Scholar
  16. Darwish, K., Ali, A. M. (2012). Arabic retrieval revisited: Morphological hole filling. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers (Vol. 2, pp. 218–222). Stroudsburg, PA: Association for Computational Linguistics, ACL’12.Google Scholar
  17. Darwish ,K., Mubarak, H. (2016). Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016.Google Scholar
  18. Dragoni, M., Da Costa Pereira, C., & Tettamanzi, A. G. (2012). A conceptual representation of documents and queries for information retrieval systems by using light ontologies. Expert Systems with Applications, 39(12), 10,376–10,388.CrossRefGoogle Scholar
  19. El Mahdaouy, A., Gaussier, E., EL Alaoui, S. O. (2014). Exploring term proximity statistic for Arabic information retrieval. In 2014 Third IEEE International Colloquium in Information Science and Technology (CIST) (pp. 272–277).Google Scholar
  20. El Mahdaouy, A., EL Alaoui, S. O., Gaussier, E. (2016). Semantically enhanced term frequency based on word embeddings for Arabic information retrieval. In 2016 4th IEEE International Colloquium on Information Science and Technology (CiSt) (pp. 385–389).Google Scholar
  21. Elkateb, W. S., Fellbaum, C. (2006). Building a wordnet for Arabic. In Proceedings of The Fifth International Conference on Language Resources and Evaluation (LREC 2006).Google Scholar
  22. Fang, H., Zhai, C. (2006). Semantic term matching in axiomatic approaches to information retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 115–122), New York: SIGIR ’06.Google Scholar
  23. Fang, H., Tao, T., Zhai, C. (2004). A formal study of information retrieval heuristics. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 49–56). New York: SIGIR ’04.Google Scholar
  24. Farghaly, A. (2004). Computer processing of arabic script-based languages. Current state and future directions. In A. Farghaly & K. Megerdoomian (Eds.), COLING 2004 computational approaches to Arabic script-based languages (pp. 1–1). COLING: Geneva.Google Scholar
  25. Faruqui, M., Dyer, C. (2014). Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014 (pp. 462–471), April 26–30, 2014, Gothenburg, Sweden.Google Scholar
  26. Fernández, M., Cantador, I., López, V., Vallet, D., Castells, P., Motta, E. (2011). Semantically enhanced information retrieval: An ontology-based approach. Web Semantics: Science, Services and Agents on the World Wide Web, 9(4), 434–452 (JWS special issue on Semantic Search).Google Scholar
  27. Ganguly, D., Roy, D., Mitra, M., Jones, G. J. (2015). Word embedding based generalized language model for information retrieval. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 795–798) New York: SIGIR ’15.Google Scholar
  28. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 50–57), New York: SIGIR ’99.Google Scholar
  29. Jaafar, Y., Bouzoubaa, K., Yousfi, A., Tajmout, R., & Khamar, H. (2016). Improving Arabic morphological analyzers benchmark. International Journal of Speech Technology, 19(2), 259–267.CrossRefGoogle Scholar
  30. Kadri, Y., Nie, J. Y. (2006). Effective stemming for arabic information retrieval. In The Challenge of Arabic for NLP/MT, International Conf. at the British Computer Society (BCS) (pp. 68–74).Google Scholar
  31. Karimzadehgan, M., Zhai, C. (2010). Estimation of statistical translation models based on mutual information for ad hoc information retrieval. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp 323–330), New York: SIGIR ’10.Google Scholar
  32. Khoja, S., Garside, R. (1999). Stemming Arabic Text. Computing Department. Lancaster University.Google Scholar
  33. Larkey, L., Ballesteros, L., & Connell, M. E. (2007). Light stemming for Arabic information retrieval. In A. Soudi, A. D. Bosch, & G. Neumann (Eds.), Arabic computational morphology, text, speech and language technology (Vol. 38, pp. 221–243). Netherlands: Springer.CrossRefGoogle Scholar
  34. Larkey, L. S., Ballesteros, L., Connell, M. E. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 275–282), New York: SIGIR ’02.Google Scholar
  35. Li, B., & Gaussier, E. (2012). An information-based cross-language information retrieval model. In R. Baeza-Yates, A. P. Vries, H. Zaragoza, B. B. Cambazoglu, V. Murdock, R. Lempel, & F. Silvestri (Eds.), 34th European conference on IR research, ECIR 2012 (Vol. 7224, pp. 281–292)., Lecture Notes in Computer Science (LNCS) Barcelone: Springer.Google Scholar
  36. Li, H., & Xu, J. (2014). Semantic matching in search. Foundations and Trends in Information Retrieval, 7(5), 343–469.CrossRefGoogle Scholar
  37. Lofi, C. (2015). Measuring semantic similarity and relatedness with distributional and knowledge-based approaches. Information and Media Technologies, 10(3), 493–501.Google Scholar
  38. Mahgoub, A. Y., Rashwan, M. A., Raafat, H., Zahran, M. A., & Fayek, M. B. (2014). Semantic query expansion for arabic information retrieval. ANLP, 2014, 87–92.Google Scholar
  39. Metzler, D., Croft, W. B. (2005). A markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 472–479), New York, NY: SIGIR ’05.Google Scholar
  40. Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of International Conference on Learning Representations, ICLR ’13.Google Scholar
  41. Mustafa, M., AbdAlla, H., & Suleman, H. (2008). Current approaches in Arabic IR: A survey (pp. 406–407). Berlin Heidelberg, Berlin, Heidelberg: Springer.Google Scholar
  42. Nwesri, A., Tahaghoghi, S., & Scholer, F. (2005). Stemming arabic conjunctions and prepositions. In M. Consens & G. Navarro (Eds.), String processing and information retrieval (Vol. 3772, pp. 206–217)., Lecture notes in computer science Berlin: Springer.CrossRefGoogle Scholar
  43. Pennington, J., Socher, R., Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543), Doha: Association for Computational Linguistics.Google Scholar
  44. Ponte, J. M., Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 275–281), New York: SIGIR ’98.Google Scholar
  45. Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M. (1994). Okapi at trec-3. In TREC’94 (pp. 109–126). City University, London.Google Scholar
  46. Sun, Y., Rao, N., Ding, W. (2017). A simple approach to learn polysemous word embeddings. CoRR abs/1707.01793, http://arxiv.org/abs/1707.01793,1707.01793.
  47. Tazit, N., Bouyakhf, E. H., Sabri, S, Yousfi, A., Bouzouba, K. (2007). Semantic internet search engine with focus on Arabic language. In the International Symposium on Computers & Arabic Language, ISCAL 07.Google Scholar
  48. Tazit, N., Yousfi, A., & Bouyakhf, E. H. (2009). Design and implementation of an information retrieval system by integrating semantic knowledge in the indexing phase. Artificial Intelligence and Machine Learning AIML, 9(1), 49–56.Google Scholar
  49. Vulić, I., Moens, M. F. (2015). Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp 363–372). New York: SIGIR ’15.Google Scholar
  50. Wei, X., Croft, W. B. (2006). Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 178–185), New York: SIGIR ’06.Google Scholar
  51. Yang, X., & Mao, K. (2016). Learning multi-prototype word embedding from single-prototype word embedding with integrated knowledge. Expert Systems with Applications, 56, 291–299.CrossRefGoogle Scholar
  52. Zahran, M. A., Magooda, A., Mahgoub, A. Y., Raafat, H., Rashwan, M., & Atyia, A. (2015). Word representations in vector space and their applications for Arabic (pp. 430–443). Cham: Springer.Google Scholar
  53. Zhai, C., Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (pp. 334–342), New York: SIGIR ’01.Google Scholar
  54. Zuccon, G., Koopman, B., Bruza, P., Azzopardi, L. (2015). Integrating and evaluating neural word embeddings in information retrieval. In Proceedings of the 20th Australasian Document Computing Symposium, ACM (pp. 12:1–12:8), New York: ADCS ’15.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Abdelkader El Mahdaouy
    • 1
    • 2
  • Saïd Ouatik El Alaoui
    • 1
  • Eric Gaussier
    • 2
  1. 1.Laboratory of Informatics and Modeling, Faculty of Sciences Dhar el MahrazSidi Mohamed Ben Abdellah UniversityFezMorocco
  2. 2.Université Grenoble Alpes, CNRS, Grenoble INP, LIGGrenobleFrance

Personalised recommendations