Cross-lingual embedding for cross-lingual question retrieval in low-resource community question answering

Abstract

In today’s digital world people are keen on finding the knowledge they need by surfing the internet to find the answers to their questions. To this aim, many Community Question Answering (CQA) systems are established, in which people can ask their question and receive the required information. The gathered data in such systems is a rich repository for people to search through the available questions that have been answered before. CQA users, however, are not always successful in finding their answers in their native CQA systems. One solution to enrich the searching process is translating input questions and searching them in other CQA systems. This solution is useless as the process of translating each question is time-consuming. To make the non-English CQA systems richer in finding the available answers, the systems can develop a model to find similar English questions. To help Persian CQA systems in providing the answers to the questions, we propose a cross-lingual question retrieval model to retrieve relevant English questions to any input Persian question. In the proposed model, we benefit from a translation model-based retrieval using neural cross-lingual word embedding. Our experiment shows that the proposed model achieves 71.4% MRR and 83.5% success@5 using supervised cross-lingual word embedding.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2

Notes

  1. 1.

    Entezar is a Persian word which is has the same meaning as “expect” and cosine similarity is able to find the related and similar words which has related or similar meanings.

  2. 2.

    http://snowball.tartarus.org/algorithms/porter/stemmer.html.

References

  1. Abdulmutalib N, Fuhr N (2010) Language models, smoothing, and IDF weighting. In Proceedings of the Information Retrieval 2010 Workshop at LWA 2010, Kassel, Germany, pp 169–174

  2. AleAhmad A, Amiri H, Darrudi E, Rahgozar M, Oroumchian F (2009) Hamshahri: a standard Persian text collection. Knowl Based Syst 22(5):382–387 ISSN 0950-7051

    Article  Google Scholar 

  3. Artetxe M, Labaka G, Agirre E (2016) Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 conference on empirical methods in natural language processing, pp 2289–2294

  4. Bae K, Ko Y (2019) Improving question retrieval in community question answering service using dependency relations and question classification. J Assoc Inf Sci Technol 70(11):1194–1209

    Article  Google Scholar 

  5. Berger A, Lafferty J(1999) Information retrieval as statistical translation. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’99, pp 222–229, New York, NY, USA, ACM

  6. Bernhard D, Gurevych I (2009) Combining lexical semantic resources with question & answer archives for translation-based answer finding. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2, ACL ’09, pp 728–736, USA, Association for Computational Linguistics

  7. Bogdanova D, Foster J (2016) This is how we do it: answer reranking for open-domain how questions with paragraph vectors and minimal feature engineering. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp 1290–1295, San Diego, California, Association for Computational Linguistics

  8. Carmel D, Lewin-Eytan L, Libov A, Maarek Y, Raviv A (2017) Promoting relevant results in time-ranked mail search. In: Barrett R, Cummings R, Agichtein E, Gabrilovich E (eds) Proceedings of the 26th International Conference on World Wide Web, WWW 2017. Perth, Australia, pp 1551–1559 ACM

    Google Scholar 

  9. Chandar S, Lauly S, Larochelle H, Khapra MM, Ravindran B, Raykar V, Saha A (2014) An auto encoder approach to learning bilingual word representations. In Proceedings of the 27th annual conference on neural information processing systems, pp 1853–1861

  10. Da San Martino G, Romeo S, Barroon-Cedeno A, Joty S, Maarquez L, Moschitti A, Nakov P (2017) Cross-language question re-ranking. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pp 1145–1148, New York, NY, USA, Association for Computing Machinery

  11. Deng Y, Lam W, Xie Y, Chen D, Li Y, Yang M, Shen Y (2020) Joint learning of answer selection and answer summary generation in community question answering. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pp 7651–7658. AAAI Press

  12. Espina A, Figueroa A (2017) Why was this asked? Automatically recognizing multiple motivations behind community question-answering questions. Expert. Syst. Appl. 80:126–135

    Article  Google Scholar 

  13. Ghasemi R, Asl AA, Momtazi S (2020) Deep Persian sentiment analysis: cross-lingual training for low-resource languages. J. Inf. Sci.

  14. Gouws S, Bengio Y, Corrado G (2015) BilBOWA: fast bilingual distributed representations without word alignments. In: Proceedings of International Conference on Machine Learning

  15. Hadifar A, Momtazi S (2018) The impact of corpus domain on word representation: a study on Persian word embeddings. J Lang Resour Eval 52(4):997–1019

    Article  Google Scholar 

  16. Jabbari F, Bakhshaei S, Mohammadzadeh S, Khadivi S (2012) Developing an open-domain English–Farsi translation system using afec: Amirkabir bilingual Farsi–English corpus. In: Proceedings of the fourth workshop on computational approaches to Arabic script-based language

  17. Joty SR, Nakov P, Màrquez L, Jaradat I (2017) Cross-language learning with adversarial neural networks: application to community question answering. Proc SIGNLL Conf Comput Nat Lang Learn, New York, pp 226–237

    Google Scholar 

  18. Karimzadehgan M, Zhai C (2010) Estimation of statistical translation models based on mutual information for ad hoc information retrieval. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, SIGIR ,10, pp 323–330, New York, NY, USA, Association for Computing Machinery

  19. Lauly S, Boulanger A, Larochelle H (2013) Learning multilingual word representations using a bag-of-words auto encoder. In: Proceedings of the neural information processing systems workshop on deep learning

  20. Luong M-T, Pham H, Manning DC (2015) Bilingual word representations with monolingual quality in mind. In: Proceedings of the 1st workshop on vector space modeling for natural language processing, pp 151–159

  21. Merkel A, Klakow D (2007) Comparing Improved Language Models for Sentence Retrieval in Question Answering. In: Proceedings of the computational linguistics in the Netherlands conference, pp 475–481

  22. Momtazi S (2018) Unsupervised latent Dirichlet allocation for supervised question classification. Inf Process Manage 54(3):380–393

    Article  Google Scholar 

  23. Momtazi S, Klakow D (2009) A word clustering approach for language model-based sentence retrieval in question answering systems. In: Proceedings of the annual international ACM conference on information and knowledge management (CIKM), pp 1911–1914. ACM

  24. Momtazi S, Klakow D (2010) Hierarchical Pitman-Yor language model for information retrieval. In: Proceedings of the annual international ACM SIGIR conference on research and development in information retrieval. ACM

  25. Murdock V, Bruce CW (2004) Simple translation models for sentence retrieval in factoid question answering. In: SIGIR 2004

  26. Ponte MJ, Bruce CW ( 1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98, pp 275–281, New York, NY, USA, Association for Computing Machinery

  27. Rücklé A, Swarnkar K, Gurevych I (2019) Improved cross-lingual question retrieval for community question answering. In: The World Wide Web Conference, pp 3179–3186. Association for Computing Machinery

  28. Ruder S, Vuliundefined I, Søgaard A (2019) A survey of cross-lingual word embedding models. J. Artif. Int. Res. 65(1):569–630 ISSN 1076-9757

    MathSciNet  MATH  Google Scholar 

  29. Smith SL, Turban DH, Hamblin S, Hammerla NY (2017) Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In: 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. OpenReview.net

  30. Søgaard A, Agić Z, Martínez Alonso H, Plank B, Bohnet B, Johannsen A (2015) Inverted indexing for cross-lingual NLP. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (Volume 1: Long Papers), pp 1713–1722, Beijing, China, Association for Computational Linguistics

  31. Vulic I, Moens M-F (2016) Bilingual distributed word representations from document-aligned comparable data. J. Artif. Int. Res. 55(1):953–994 ISSN 1076-9757

    MathSciNet  MATH  Google Scholar 

  32. Vyas Y, Carpuat M (2016) Sparse bilingual word representations for cross-lingual lexical entailment. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1187–1197, San Diego, California, Association for Computational Linguistics

  33. Xu B, Xing Z, Xia X, Lo D, Le DX-B (2017) Xsearch: a domain-specific cross-language relevant question retrieval tool. In: Proceedings of the 2017 11th joint meeting on foundations of software engineering, ESEC/FSE 2017, pp 1009–1013, New York, NY, USA, Association for computing machinery

  34. Sha Yuan Yu, Zhang JT, Hall W, Cabotà JB (2020) Expert finding in community question answering: a review. Artif. Intell. Rev. 53(2):843–874

    Article  Google Scholar 

  35. Zhai C, Lafferty J (2004) A Study of Smoothing Methods for Language Models Applied to Information Retrieval. ACM Transactions on Information Systems (TOIS), New York

    Google Scholar 

  36. Zuccon G, Koopman B, Bruza P, Azzopardi L (2015) Integrating and evaluating neural word embeddings in information retrieval. In: Proceedings of the 20th Australasian document computing symposium, ADCS ’15, New York, NY, USA, Association for Computing Machinery

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Saeedeh Momtazi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

HajiAminShirazi, S., Momtazi, S. Cross-lingual embedding for cross-lingual question retrieval in low-resource community question answering. Machine Translation (2021). https://doi.org/10.1007/s10590-020-09257-7

Download citation

Keywords

  • Community question answering
  • Cross-lingual embedding
  • Question retrieval
  • Low-resource languages