A Large-Scale Community Questions Classification Accounting for Category Similarity: An Exploratory Study
The paper reports on a large-scale topical categorization of questions from a Russian community question answering (CQA) service Otvety@Mail.Ru. We used a data set containing all the questions (more than 11 millions) asked by Otvety@Mail.Ru users in 2012. This is the first study on question categorization dealing with non-English data of this size. The study focuses on adjusting category structure in order to get more robust classification results. We investigate several approaches to measure similarity between categories: the share of identical questions, language models, and user activity. The results show that the proposed approach is promising.
KeywordsQuestion topic categorization Community question answering Question retrieval Large-scale classification
- 1.Chan, W., Yang, W., Tang, J., Du, J., Zhou, X., Wang, W.: Community question topic categorization via hierarchical kernelized classification. In: Proceedings of the 22nd ACM International Conference on Conference on Information and Knowledge Management, pp. 959–968 (2013)Google Scholar
- 2.Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, pp. 380–388 (2002)Google Scholar
- 3.Cao, X., Cong, G., Cui, B., Jensen, C.S., Zhang, C.: The use of categorization information in language models for question retrieval. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 265–274 (2009)Google Scholar
- 5.Blooma, M.J., Coh, D.H.-L., Chua, A.Y.: Question classification in social media. Int. J. Inf. Stud. 1(2), 101–109 (2009)Google Scholar
- 6.Li, B., King, I., Lyu, M.R.: Question routing in community question answering: putting category in its place. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 2041–2044 (2011)Google Scholar
- 7.Duan, H., Cao, Y., Lin, C.Y., Yu, Y.: Searching Questions by Identifying Question Topic and Question Focus. In: ACL, pp. 156–164 (2008)Google Scholar
- 8.Cao, X., Cong, G., Cui, B., Jensen, C.S.: A generalized framework of exploring category information for question retrieval in community question answer archives. In: Proceedings of the 19th International Conference on World Wide Web, pp. 201–210 (2010)Google Scholar
- 9.Cai, L., Zhou, G., Liu, K., Zhao, J.: Large-scale question classification in CQA by leveraging Wikipedia semantic knowledge. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 1321–1330 (2011)Google Scholar
- 10.Broder, A.Z., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V., Zhang, T.: Robust classification of rare queries using web knowledge. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 231–238 (2007)Google Scholar
- 12.Yuan, Q., Cong, G., Sun, A., Lin, C.Y., Thalmann, N.M.: Category hierarchy maintenance: a data-driven approach. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 791–800 (2012)Google Scholar
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.