Efficient question classification and retrieval using category information and word embedding on cQA services

  • Kyoungman Bae
  • Youngjoong KoEmail author


Classifying the task of automatically assigning unlabeled questions into predefined categories (or topics) and effectively retrieving a similar question are crucial aspects of an effective cQA service. We first address the problems associated with estimating and utilizing the distribution of words in each category of word weights. We then apply an automatic expansion word generation technique that is based on our proposed weighting method and the pseudo relevance feedback to question classification. Secondly to address the lexical gap problem in question retrieval, the case frame of the sentence is first defined using the extracted components of a sentence, and a similarity measure based on the case frame and the word embedding is then derived to determine the similarities between two sentences. These similarities are then used to reorder the results of the first retrieval model. Consequently, the proposed methods significantly improve the performance of question classification and retrieval.


Question classification Word weighting method Category information Pseudo-relevance feedback Question expansion 



This work was supported by Institute for Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No. 2013-2-00131, Development of Knowledge Evolutionary WiseQA Platform Technology for Human Knowledge Augmented Services).


  1. Bae, K.M., & Ko, T. J. (2014). An effective question expanding method for question classification in cqa services, PIKM ’14: 51–55.
  2. Bernhard, D., & Gurevych, I. (2009). Combining lexical semantic resources with question & answer archives for translation-based answer finding, ACL ’09, pp. 728—736.
  3. Berger, A., & Lafferty, J. (1999). Information retrieval as statistical translation, SIGIR’99, pp. 222–229.
  4. Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computaional Linguistics, 19(2), 263–311.Google Scholar
  5. Bracewell, D. B., Yan, J., Ren, F., Kuroiwa, S. (2009). Category classification and topic discovery of Japanese and English news articles. Electronic Notes in Theoretical Computer Science, 225(2), 51–65. Scholar
  6. Cai, L., Zhou, G., Liu, K., Zhau, J. (2011). Large-Scal question classification in cQA by leveraging Wikipedia semantic knowledge, CIKM ’11, pp. 1321–1330.
  7. Cao, G., Gao, J., Robertson, S. (2008). Selecting good expansion terms for pseudo-relevance feedback, SIGIR ’08, pp. 243–250.
  8. Cai, L., Zhou, G., Liu, K., Zhao, J. (2012). Learning the latent topics for question retrieval in community QA, ACL’12, pp. 273–281.Google Scholar
  9. Cao, X., Cong, G., Cui, B., Jensen, C. S., Zhang, C. (2009). The use of categorization information in language models for question retrieval, CIKM’09, pp 265–274.
  10. Cao, X., Cong, G., Cui, B., Jensen, C. S. (2010). A generalized framework of exploring category information for question retrieval in community question answer archives, WWW’10, pp. 201–210.
  11. Duan, H., Cao, Y., Lin, C. Y., Yu, Y. (2008). Searching questions by identifying questions topics and question focus, ACL’08, pp. 156–164.Google Scholar
  12. Elci, A. (2011). Text classification by PNN-based term re-weighting. International Journal of Computer Applications (0975 — 8887), 29(12), 7–13. Scholar
  13. Huang, Q., Song, D., Ruger, S. (2008). Robust query-specific pseudo feedback document selection for query expasion, ECIR ’08. LNCS, 4956, 547–554.Google Scholar
  14. Huang, P., Bu, J. J., Chen, C., Qiu, G. (2007). An effective feature-weighting model for question classification, CIS ’07, pp. 32–36.
  15. Jiang, H., Li, P., Hu, X., Wang, S. (2009). An improved method of term weighting for text classification, ICIS ’09, pp. 294–298.
  16. Jehl, L., Hieber, F., Riezler, S. (2012). Twitter translation using translation-based cross-lingual retrieval, WMT ’12, pp. 410—421.Google Scholar
  17. Jeon, J., Croft, W. B., Lee, J. H. (2005). Finding similar questions in large question and answer archives, CIKM ’05, pp. 84—90.
  18. Ji, Z., Xu, F., Wang, B., He, B. (2012). Question retrieval with high quality answers in community question answering, CIKM’12, pp. 2471–2474.
  19. Karimzadehgan, M., & Zhai, C. X. (2010). Estimation of statistical translation models based on mutual information for ad hoc information retrieval, SIGIR’10, pp. 323–330.
  20. Kim, S. H., Ko, Y. J., Oard, D. W. (2015). Combining lexical and statistical translation evidence for cross-language information retrieval. Journal of the American Society for Information Science and Technology, 66(1), 1–17. Scholar
  21. Lee, K. S., Croft, W. B., Allan, J. (2008a). A cluster-based resampling method for pseudo-relevance feedback, SIGIR ’08, pp. 235–242.
  22. Lee, Z.S., Maarof, M. A., Selamat, A., Shamsuddin, S. M. (2008b). Enhance term weighting algorithm as feature selection technique for illicit web content classification, ISDA ’08, pp. 145–150.
  23. Li, R., & Guo, X. (2010). An improved algorithm to term weighting in text classification, ICMT ’10, pp. 1–3.
  24. Loni, B. (2011). A survey of state-of-the-art methods on question classification, (pp. 1–40). Delft University of Technology: Tech. Rep. Scholar
  25. Magdy, W., & Jones, G. J. F. (2011). A study on query expansion methods for patent retrieval, PaIR ’11, pp. 19–24.
  26. Manning, C. D., Raghavan, P., Schutze, H. (2007). An introduction to information retrieval, (pp. 173–1). Cambridge: Cambridge University Press.zbMATHGoogle Scholar
  27. Murdock, V., & Croft, W. B. (2005). A statistical model for sentence retrieval, EMNLP ’05, pp. 684–691.Google Scholar
  28. Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval, SIGIR’98, pp. 275–281.
  29. Quan, X., Liu, W., Bite, Q. (2011). Term weighting schemes for question categorization. Pattern Analysis and Machine Intelligence, 33(5), 1009–1021. Scholar
  30. Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M. (1994). Okapi at trec-3, TREC-3, pp. 109–126.Google Scholar
  31. Robertson, S.E., & Walker, S. (1999). Okapi/Keenbow at TREC-8. In: TREC-8, pp. 151–161.
  32. Ruthven, I., & Lalmas, M. (2003). A survey on the use of relevance feedback for information access systems. The Knowledge Engineering Review, 18(2), 95–145. Scholar
  33. Salton, G., Wong, A., Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620. Scholar
  34. Shah, C., & Pomerantz, J. (2010). Evaluating and predicting answer quality in community QA, SIGIR ’10, pp. 411–418.
  35. Sun, R., Ong, C. H., Chua, T. S. (2006). Mining dependency relations for query expansion in passage retrieval, SIGIR ’06, pp. 382–389.
  36. Yang, X., Jones, G. J., Wang, B. (2009). Query dependent pseudo-relevance feedback based on Wikipedia, SIGIR ’09, pp. 59–66.
  37. Yu, S., Cai, D., Wen, J. R., Ma, W. Y. (2003). Improving pseudo-relevance feedback in web information retrieval using web page segmentation, WWW ’03, pp. 11–18.
  38. Xue, X., & Croft, W. B. (2008). Retrieval models for question and answer archives, SIGIR ’08, pp. 475–482.
  39. Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information System, 22(2), 179–214. Scholar
  40. Zhang, K., Wu, W., Wu, H., Li, Z., Zhou, M. (2014). Question retrieval with high quality answers in community question answering, CIKM’14, pp. 371–380.

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Language Intelligence Research GroupElectronics and Telecommunications Research InstituteDaejeonRepublic of Korea
  2. 2.Department of Computer EngineeringDong-A University 840BusanRepublic of Korea

Personalised recommendations