Domain-specific cross-language relevant question retrieval

Xu, Bowen; Xing, Zhenchang; Xia, Xin; Lo, David; Li, Shanping

doi:10.1007/s10664-017-9568-3

Domain-specific cross-language relevant question retrieval

Published: 04 November 2017

Volume 23, pages 1084–1122, (2018)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Bowen Xu¹,
Zhenchang Xing²,
Xin Xia ORCID: orcid.org/0000-0002-6302-3256^3,4,
David Lo⁵ &
…
Shanping Li¹

681 Accesses
11 Citations
3 Altmetric
Explore all metrics

Abstract

Chinese developers often cannot effectively search questions in English, because they may have difficulties in translating technical words from Chinese to English and formulating proper English queries. For the purpose of helping Chinese developers take advantage of the rich knowledge base of Stack Overflow and simplify the question retrieval process, we propose an automated cross-language relevant question retrieval (CLRQR) system to retrieve relevant English questions for a given Chinese question. CLRQR first extracts essential information (both Chinese and English) from the title and description of the input Chinese question, then performs domain-specific translation of the essential Chinese information into English, and finally formulates an English query for retrieving relevant questions in a repository of English questions from Stack Overflow. We propose three different retrieval algorithms (word-embedding, word-matching, and vector-space-model based methods) that exploit different document representations and similarity metrics for question retrieval. To evaluate the performance of our approach and investigate the effectiveness of different retrieval algorithms, we propose four baseline approaches based on the combination of different sources of query words, query formulation mechanisms and search engines. We randomly select 80 Java, 20 Python and 20 .NET questions in SegmentFault and V2EX (two Chinese Q&A websites for computer programming) as the query Chinese questions. We conduct a user study to evaluate the relevance of the retrieved English questions using CLRQR with different retrieval algorithms and the four baseline approaches. The experiment results show that CLRQR with word-embedding based retrieval achieves the best performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Stack Overflow 2016 Developer Survey, https://stackoverflow.com/research/developer-survey-2016
Planet Earth Has 18.5 Million Developers, http://www.drdobbs.com/tools/planet-earth-has-185-million-developers/240165016
FudanNLP, available at http://nlp.fudan.edu.cn
IctclasNLP, available at http://ictclas.nlpir.org/docs
Youdao translation API, available at http://fanyi.youdao.com/openapi
Java’s methods versus functions, available at http://stackoverflow.com/questions/16223531/javas-methods-vs-functions
Web translation result, http://faq.youdao.com/dict/?p=65
Stop-word list, available at http://snowball.tartarus.org/algorithms/english/stop.txt
The 120 Query Chinese questions, available at https://goo.gl/zAbLVp
Stack Exchange Data Dump, available at https://archive.org/download/stackexchange
A Chinese question on V2EX, available at https://www.v2ex.com/t/47663
A Chinese question on SegmentFault available at https://SegmentFault.com/q/1010000003408795
Fig. 10
A Chinese Question on SegmentFault
Full size image
A Chinese question on V2EX, available at https://www.v2ex.com/t/137913

References

Aceves-Pérez RM, Montes-y Gómez M, Villaseñor-Pineda L (2007) Enhancing cross-language question answering by combining multiple question translations. In: Computational Linguistics and Intelligent Text Processing, Springer, pp 485–493
Baeza-Yates R, Ribeiro-Neto B et al (1999) Modern information retrieval, vol 463. ACM Press, New York
Google Scholar
Bao L, Lo D, Xia X, Li S (2017) Automated android application permission recommendation. Sci China Inf Sci 60(9):092,110
Article Google Scholar
Canfora G, Cerulo L (2005) How software repositories can help in resolving a new change request. STEP 2005:99
Google Scholar
Cohen J (1988) Statistical power analysis for the behavioral sciences. hilsdale. Lawrence Earlbaum Associates, New Jersey, p 2
Google Scholar
Cui H, Wen JR, Nie JY, Ma WY (2002) Probabilistic query expansion using query logs. In: Proceedings of the 11th international conference on World Wide Web, ACM, pp 325–332
Haiduc S, Bavota G, Marcus A, Oliveto R, De Lucia A, Menzies T (2013a) Automatic query reformulations for text retrieval in software engineering. In: 2013 35th international conference on software engineering (ICSE), IEEE, pp 842–851
Haiduc S, De Rosa G, Bavota G, Oliveto R, De Lucia A, Marcus A (2013b) Query quality prediction and reformulation for source code search: The refoqus tool. In: Proceedings of the 2013 international conference on software engineering, IEEE Press, pp 1307–1310
Harkness (2017) Why are some chinese students who have learnt english for years still poor in english? https://goo.gl/7ltMLy
Harris ZS (1954) Distributional structure. Word 10(2-3):146–162
Article Google Scholar
Hayes JH, Sultanov H, Kong WK, Li W (2011) Software verification and validation research laboratory (svvrl) of the university of kentucky: traceability challenge 2011: language translation. Selabnetlabukyedu pp 50–53
Hiemstra D, De Jong F, Kraaij W (1997) A domain specific lexicon acquisition tool for cross-language information retrieval. In: Computer-Assisted Information Searching on Internet, LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, pp 255–268
Hill E, Pollock L, Vijay-Shanker K (2009) Automatically capturing source code context of nl-queries for software maintenance and reuse. In: IEEE 31st international conference on software engineering, 2009. ICSE 2009. IEEE, pp 232–242
Hull DA, Grefenstette G (1996) A dictionary-based approach to multilingual informaion retrieval. In: Proceedings of the 19th international conference on research and development in information retrieval, pp 49–57
Jones G, Sakai T, Collier N, Kumano A, Sumita K (1999) A comparison of query translation methods for english-japanese cross-language information retrieval. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 269–270
Jui SL (2010) Innovation in China: the Chinese software industry. Routledge, Abingdon
Google Scholar
Kluck M, Gey FC (2001a) The domain-specific task of clef - specific evaluation strategies in cross-language information retrieval. In: Peters C. (ed) Proceedings of the CLEF 2000 evaluation forum, pp 48–56
Kluck M, Gey FC (2001b) The domain-specific task of clef-specific evaluation strategies in cross-language information retrieval. In: Cross-Language Information Retrieval and Evaluation, Springer, pp 48–56
Kraaij W, Nie JY, Simard M (2003) Embedding web-based statistical translation models in cross-language information retrieval. Comput Linguist 29(3):381–419
Article MATH Google Scholar
Liu X, Gong Y, Xu W, Zhu S (2002) Document clustering with cluster refinement and model selection capabilities. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 191–198
Lucia AD, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4):50. Acm Transactions on Software Engineering & Methodology 16
Article Google Scholar
Maaten LVD, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(Nov):2579–2605
MATH Google Scholar
Marcus A, Sergeyev A, Rajlich V, Maletic JI (2004) An information retrieval approach to concept location in source code. In: 11th working conference on reverse engineering, 2004. Proceedings. IEEE, pp 214–223
Mihalcea R, Tarau P (2004) Textrank: Bringing order into texts. Association for Computational Linguistics
Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, July 16-20, 2006, Boston, Massachusetts, USA, pp 775–780
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Peñas A, Magnini B, Forner P, Sutcliffe R, Rodrigo Á, Giampiccolo D (2012) Question answering at the cross-language evaluation forum 2003–2010. Lang Resour Eval 46(2):177–217
Article Google Scholar
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Article Google Scholar
Poshyvanyk D, Gueheneuc YG, Marcus A, Antoniol G, Rajlich VC (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Softw Eng 33(6):420–432
Article Google Scholar
Řehůřek R, Sojka P (2010) Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta, pp 45–50. http://is.muni.cz/publication/884893/en
Resnik P, Melamed ID (1997) Semi-automatic acquisition of domain-specific translation lexicons. In: Proceedings of the fifth conference on Applied natural language processing, Association for Computational Linguistics, pp 340–347
Saggion H, Radev D, Teufel S, Lam W, Strassel SM (2002) Developing infrastructure for the evaluation of single and multi-document summarization systems in a cross-lingual environment. Ann Arbor 1001(48):109–1092
Google Scholar
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Article Google Scholar
Shepherd D, Pollock L, Tourwé T (2005) Using language clues to discover crosscutting concerns. Acm Sigsoft Soft Engineer Notes 30:1–6
Article Google Scholar
Shepherd D, Fry ZP, Hill E, Pollock L, Vijay-Shanker K (2007) Using natural language program analysis to locate and understand action-oriented concerns. In: Proceedings of the 6th international conference on Aspect-oriented software development, ACM, pp 212–224
Tan PN et al (2006) Introduction to data mining. Pearson Education, London
Google Scholar
Thai P (2007) An introduction to cross-language information retrieval approaches. Web.simmons.edu
Čubranić D, Murphy GC (2003) Hipikat: recommending pertinent software development artifacts. In: 25th international conference on software engineering, 2003. Proceedings. pp 408–418
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83. JSTOR
Article Google Scholar
Xia X, Lo D (2017) An effective change recommendation approach for supplementary bug fixes. Autom Softw Eng 24(2):455–498. Springer
Article Google Scholar
Xia X, Lo D, Wang X, Zhang C, Wang X (2014) Cross-language bug localization. In: Proceedings of the 22nd International Conference on Program Comprehension, ACM, pp 275–278
Xia X, Lo D, Wang X, Yang X (2015) Who should review this change?: Putting text and file location analyses together for more accurate recommendations. In: 2015 IEEE international conference on software maintenance and evolution (ICSME), IEEE, pp 261–270
Xu B, Xing Z, Xia X, Lo D, Wang Q, Li S (2016) Domain-specific cross-language relevant question retrieval. In: Proceedings of the 13th International Workshop on Mining Software Repositories, ACM, pp 413– 424
Xu B, Xing Z, Xia X, Lo D (2017a) Answerbot - automated generation of answer summary to developers technical questions. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, IEEE, p Accepted
Xu B, Xing Z, Xia X, Lo D, Le XBD (2017b) Xsearch: a domain-specific cross-language relevant question retrieval tool. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ACM, pp 1009–1013
Yang J, Tan L (2012) Inferring semantically related words from software context. In: Proceedings of the 9th IEEE Working Conference on Mining Software Repositories, IEEE Press, pp 161–170
Yang X, Lo D, Xia X, Bao L, Sun J (2016) Combining word embedding with information retrieval to recommend similar bug reports. In: 2016 IEEE 27th international symposium on software reliability engineering (ISSRE), IEEE, pp 127–137
Zhang Y, Lo D, Xia X, Sun JL (2015) Multi-factor duplicate question detection in stack overflow. J Comput Sci Technol 30(5):981–997
Article Google Scholar
Zhang Y, Lo D, Xia X, Le TDB, Scanniello G, Sun J (2016) Inferring links between concerns and methods with multi-abstraction vector space model. In: 2016 IEEE international conference on software maintenance and evolution (ICSME), IEEE, pp 110–121
Zhang Y, Lo D, Kochhar PS, Xia X, Li Q, Sun J (2017) Detecting similar repositories on github. In: 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER), IEEE, pp 13–23
Zhou J, Zhang H, Lo D (2012) Where should the bugs be fixed?-more accurate information retrieval-based bug localization based on bug reports. In: Proceedings of the 34th International Conference on Software Engineering, IEEE Press, pp 14–24

Download references

Acknowledgment

This work was partially supported by NSFC Program (No. 61602403 and 61572426).

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Bowen Xu & Shanping Li
Research School of Computer Science, Australian National University, Canberra, Australia
Zhenchang Xing
Faculty of Information Technology, Monash University, Melbourne, Australia
Xin Xia
Australia College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Xin Xia
School of Information Systems, Singapore Management University, Singapore, Singapore
David Lo

Authors

Bowen Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenchang Xing
View author publications
You can also search for this author in PubMed Google Scholar
Xin Xia
View author publications
You can also search for this author in PubMed Google Scholar
David Lo
View author publications
You can also search for this author in PubMed Google Scholar
Shanping Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Xia.

Additional information

Communicated by: Romain Robbes, Christian Bird and Emily Hill

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, B., Xing, Z., Xia, X. et al. Domain-specific cross-language relevant question retrieval. Empir Software Eng 23, 1084–1122 (2018). https://doi.org/10.1007/s10664-017-9568-3

Download citation

Published: 04 November 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s10664-017-9568-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Domain-specific cross-language relevant question retrieval

Abstract

Access this article

Similar content being viewed by others

QE-integrating framework based on Github knowledge and SVM ranking

Discovering semantically related technical terms and web resources in Q&A discussions

Multi-intent Description of Keyword Expansion for Code Search

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Domain-specific cross-language relevant question retrieval

Abstract

Access this article

Similar content being viewed by others

QE-integrating framework based on Github knowledge and SVM ranking

Discovering semantically related technical terms and web resources in Q&A discussions

Multi-intent Description of Keyword Expansion for Code Search

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation