Skip to main content

Exploring the Feature Selection-Based Data Analytics Solutions for Text Mining Online Communities by Investigating the Influential Factors: A Case Study of Programming CQA in Stack Overflow

  • Chapter
  • First Online:
Big Data Applications and Use Cases
  • 2601 Accesses

Abstract

Community question answering (CQA) services accumulate large amount of knowledge through the voluntary services of the community across the globe. In fact, CQA services gained much popularity recently compared to other Internet services in obtaining and exchanging information. Stack Overflow is an example of such a service that targets programmers and software developers. In general, most questions in Stack Overflow are usually ended up with an answer accepted by the askers. However, it is found that the number of unanswered or ignored questions has increased significantly in the past few years. Understanding the factors that contribute to questions being answered as well as questions remain ignored can help information seekers to improve the quality of their questions and increase their chances of getting answers from the community in Stack Overflow. In this study, we attempt to identify by data mining techniques the relevant features that will help predict the quality of questions, and validate the reliability of the features using some of the state-of-the-art classification algorithms. The features to be obtained have to be significant in the sense that they can help Stack Overflow to improve their existing CQA service in terms of user satisfaction in obtaining quality answers from their questions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.answerbag.com/

  2. 2.

    http://stackoverflow.com/

  3. 3.

    http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html

  4. 4.

    http://www.experts-exchange.com/

  5. 5.

    https://data.stackexchange.com/

  6. 6.

    http://www.nltk.org/

  7. 7.

    http://scikit-learn.org/

References

  1. C. Shah, J. Pomerantz, Evaluating and predicting answer quality in community QA, in Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, (2010), pp. 411–418

    Google Scholar 

  2. B. Li, Y. Liu, A. Ram, E. V. Garcia, E. Agichtein, Exploring question subjectivity prediction in community QA, in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, (2008), pp. 735–736

    Google Scholar 

  3. L. Chen, D. Zhang, L. Mark, Understanding user intent in community question answering, in Proceedings of the 21st international conference companion on World Wide Web, (2012), pp. 823–828

    Google Scholar 

  4. A. Anderson, D. Huttenlocher, J. Kleinberg, J. Leskovec, Discovering value from community activity on focused question answering sites: a case study of stack overflow, in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, (2012), pp. 850–858

    Google Scholar 

  5. C. Chen, K. Wu, V. Srinivasan, R. K. Bharadwaj, The best answers? think twice: online detection of commercial campaigns in the CQA forums, in Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ACM, 2013), pp. 458–465

    Google Scholar 

  6. Y. Cai, S. Chakravarthy, Predicting Answer Quality in Q/A Social Networks: Using Temporal Features (2011)

    Google Scholar 

  7. A. Barua, S.W. Thomas, A.E. Hassan, What are developers talking about? An analysis of topics and trends in Stack Overflow. Empir. Software. Eng. 19(3), 619–654 (2014)

    Article  Google Scholar 

  8. C. Treude, O. Barzilay, M. A. Storey, How do programmers ask and answer questions on the web?: Nier track, in Software Engineering (ICSE), 2011 33rd International Conference, (2011), pp. 804–807

    Google Scholar 

  9. L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, B. Hartmann, Design lessons from the fastest q&a site in the west, in Proceedings of the SIGCHI conference on Human factors in computing systems, (2011), pp. 2857–2866

    Google Scholar 

  10. S. M. Nasehi, J. Sillito, F. Maurer, C. Burns, What makes a good code example?: A study of programming Q&A in StackOverflow, in Software Maintenance (ICSM), 2012 28th IEEE International Conference, (2012), pp. 25–34

    Google Scholar 

  11. B. Vasilescu, A. Capiluppi, A. Serebrenik, Gender, representation and online participation: a quantitative study of Stackoverflow, in International Conference on Social Informatics (2012)

    Google Scholar 

  12. F. Riahi, Z. Zolaktaf, M. Shafiei, E. Milios, Finding expert users in community question answering, in Proceedings of the 21st international conference companion on World Wide Web, (ACM, 2012), pp. 791–798

    Google Scholar 

  13. D. Correa, A. Sureka, Chaff from the Wheat: Characterization and Modeling of Deleted Questions on Stack Overflow. (2014). arXiv preprint arXiv:1401.0480

    Google Scholar 

  14. L. Yang, S. Bao, Q. Lin, X. Wu, D. Han, Z. Su, Y. Yu, Analyzing and Predicting Not-Answered Questions in Community-based Question Answering Services, (AAAI, 2011)

    Google Scholar 

  15. G. Wang, K. Gill, M. Mohanlal, H. Zheng, B. Y. Zhao, Wisdom in the social crowd: an analysis of quora, in Proceedings of the 22nd international conference on World Wide Web, 1341-1352. International World Wide Web Conferences Steering Committee, (2013)

    Google Scholar 

  16. E. Agichtein, C. Castillo, D. Donato, A. Gionis, G. Mishne, Finding high-quality content in social media, in Proceedings of the international conference on Web search and web data mining, (2008), pp. 183–194

    Google Scholar 

  17. G. Dror, D. Pelleg, O. Rokhlenko, I. Szpektor, Churn prediction in new users of Yahoo! answers, in Proceedings of the 21st international conference companion on World Wide Web, (ACM, 2012), pp. 829–834

    Google Scholar 

  18. L. C. Lai, H. Y. Kao, Question Routing by Modeling User Expertise and Activity in cQA services, in The 26th Annual Conference of the Japanese Society for Artificial Intelligence, (2012)

    Google Scholar 

  19. M.J. Blooma, D.H.L. Goh, A.Y.K. Chua, Predictors of high-quality answers. Online Inform. Rev. 36(3), 383–400 (2012)

    Article  Google Scholar 

  20. Y. Miao, C. Li, J. Tang, L. Zhao, Identifying new categories in community question answering archives: a topic modeling approach, in Proceedings of the 19th ACM international conference on Information and knowledge management, (ACM, 2010), pp. 1673–1676

    Google Scholar 

  21. S. Suzuki, S. I. Nakayama, H. Joho, Formulating effective questions for community-based question answering, in Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, (2011), pp. 1261–1262

    Google Scholar 

  22. A. Singh, K. Visweswariah, CQC: classifying questions in CQA websites, in Proceedings of the 20th ACM international conference on Information and knowledge management, (2011), pp. 2033–2036

    Google Scholar 

  23. X. Quan, L. Wenyin, Analyzing Question Attractiveness in Community Question Answering. Modern Advances in Intelligent Systems and Tools 431, 141–146 (2012)

    Article  Google Scholar 

  24. H. Xuan, Y. Yang, C. Peng, An expert finding model based on topic clustering and link analysis in CQA website. J. Network Inform. Secur. 4(2), 165–176 (2013)

    Google Scholar 

  25. X. J. Wang, X. Tu, D. Feng, L. Zhang, Ranking community answers by modeling question-answer relationships via analogical reasoning, in Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, (ACM, 2009), pp. 179–186

    Google Scholar 

  26. M. S. Pera, Y. K. Ng, A community question-answering refinement system, in Proceedings of the 22nd ACM conference on Hypertext and hypermedia, (ACM, 2011), pp. 251–260

    Google Scholar 

  27. C. Danescu, G. Kossinets, J. Kleinberg, L. Lee, How opinions are received by online communities: a case study on amazon.com helpfulness votes, in Proceedings of the 18th international conference on World wide web, (2009), pp. 141–150

    Google Scholar 

  28. L. Hong, Z. Yang, B. D. Davison, Incorporating participant reputation in community-driven question answering systems, in Computational Science and Engineering, 2009. CSE'09. International Conference, 4, 475–480, (2009)

    Google Scholar 

  29. C. Souza, J. Magalhães, E. Costa, J. Fechine, Routing Questions in Twitter: An Effective Way to Qualify Peer Helpers, in Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2013 IEEE/WIC/ACM International Joint Conferences, (2013), 1, pp. 109–114

    Google Scholar 

  30. B. Li, I. King, M. R. Lyu, Question routing in community question answering: putting category in its place, in Proceedings of the 20th ACM international conference on Information and knowledge management, (2011), pp. 2041–2044

    Google Scholar 

  31. Y. Tang, F. Li, M. Huang, X. Zhu, Summarizing similar questions for chinese community question answering portals, in Information Technology and Computer Science (ITCS), 2010 Second International Conference on (2010), pp. 36–39, IEEE

    Google Scholar 

  32. W. Zhang, L. Pang, C. W. Ngo, FashionAsk: pushing community answers to your fingertips, in Proceedings of the 20th ACM international conference on Multimedia, (ACM, 2012), pp. 1345–1346

    Google Scholar 

  33. Z. Zhang, Q. Li, D. Zeng, Evolutionary community discovery from dynamic multi-relational CQA networks, in Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on, (2010), vol. 3, pp. 83–86. IEEE

    Google Scholar 

  34. P. Shachaf, Social reference: toward a unifying theory. Libr. Inf. Sci. Res. 32(1), 66–76 (2010)

    Article  Google Scholar 

  35. P. Fichman, A comparative assessment of answer quality on four question answering sites. J. Inform. Sci. 37(5), 476–486 (2011)

    Article  Google Scholar 

  36. A.Y. Chua, R.S. Balkunje, The outreach of digital libraries: a globalized resource network, in Comparative evaluation of community question answering websites (Springer, Berlin Heidelberg, 2012), pp. 209–218

    Google Scholar 

  37. M. Asaduzzaman, A. S. Mashiyat, C. K. Roy, K. A. Schneider, Answering questions about unanswered questions of stack overflow. in Proceedings of the Tenth International Workshop on Mining Software Repositories (IEEE Press, 2013), pp. 97–100

    Google Scholar 

  38. E. Agichtein, Y. Liu, J. Bian, Modeling information-seeker satisfaction in community question answering. ACM T. Knowl. Discov. D. 3(2), 10 (2009)

    Google Scholar 

  39. J. Bian, Y. Liu, D. Zhou, E. Agichtein, H. Zha, Learning to recognize reliable users and content in social media with coupled mutual reinforcement, in Proceedings of the 18th international conference on World wide web (ACM, 2009), pp. 51–60

    Google Scholar 

  40. B. Li, T. Jin, M. R. Lyu, I. King, B. Mak, Analyzing and predicting question quality in community question answering services, in Proceedings of the 21st international conference companion on World Wide Web (ACM, 2012), pp. 775–782

    Google Scholar 

  41. Y. Liu, J. Bian, E. Agichtein, Predicting information seeker satisfaction in community question answering, in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, (ACM, 2008), pp. 483–490

    Google Scholar 

  42. M. Bovee, R.P. Srivastava, B. Mak, A conceptual framework and belief‐function approach to assessing overall information quality. Int J. Intell. Syst. 18(1), 51–74 (2003)

    Article  MATH  Google Scholar 

  43. C.M. Bishop, Pattern Recognition and Machine Learning, 1st edn. (Springer, New York, 2006), p. 740

    MATH  Google Scholar 

  44. S. Bird, NLTK: the natural language toolkit, in Proceedings of the COLING/ACL on Interactive presentation sessions (Association for Computational Linguistics, 2006), pp. 69–72

    Google Scholar 

  45. M. Taboada, J. Grieve, Analyzing appraisal automatically, in Proceedings of AAAI Spring Symposium on Exploring Attitude and Affect in Text (AAAI Technical Re# port SS# 04# 07), (Stanford University, CA, 2004), pp. 158q161, AAAI Press

    Google Scholar 

  46. S.Y. Rieh, D.R. Danielson, Credibility: a multidisciplinary framework, in Annual review of information science and technology, ed. by B. Cronin (Information Today, Medford, NJ, 2007), pp. 307–64

    Google Scholar 

  47. M.A. Suryanto, E.P. Lim, A. Sun, R.H.L. Chiang, Quality-aware collaborative question answering: methods and evaluation, in Proceedings of the WSDM ’09 Workshop on Exploiting Semantic Annotations in Information Retrieval, (ACM Press, New York, NY, 2009), pp. 142–151

    Google Scholar 

  48. J. Han, M. Kamber, J. Pei, Data mining: concepts and techniques (Morgan Kaufmann, San Francisco, 2006)

    MATH  Google Scholar 

  49. L.A. Shalabi, Z. Shaaban, B. Kasasbeh, Data mining: a preprocessing engine. J. Comput. Sci. 2(9), 735 (2006)

    Article  Google Scholar 

  50. I. Guyon, A. Elisseeff, An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  51. A. Y. Ng, Feature selection, L 1 vs. L 2 regularization, and rotational invariance, in Proceedings of the twenty-first international conference on Machine learning, (2004, ACM), pp. 78

    Google Scholar 

  52. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, E. Duchesnay, Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  53. C. A. Ratanamahatana, D. Gunopulos, Scaling up the naive Bayesian classifier: Using decision trees for feature selection (2002)

    Google Scholar 

  54. K. Weinberger, J. Blitzer, L. Saul, Distance metric learning for large margin nearest neighbor classification. Adv. Neural Inf. Process. Syst. 18, 1473 (2006)

    Google Scholar 

  55. R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 14(2), 1137–1145 (1995)

    Google Scholar 

  56. A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30(7), 1145–1159 (1997)

    Article  Google Scholar 

  57. J. Huang, C.X. Ling, Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005)

    Article  Google Scholar 

  58. R. Caruana, A. Niculescu-Mizil, An empirical comparison of supervised learning algorithms, in Proceedings of the 23rd international conference on Machine learning (ACM, 2006), pp. 161–168

    Google Scholar 

  59. Y. Chen, R. Dios, A. Mili, L. Wu, K. Wang, An empirical study of programming language trends. Software IEEE 22(3), 72–79 (2005)

    Article  Google Scholar 

  60. M. Frické, D. Fallis, Indicators of accuracy for answers to ready reference questions on the internet. J. Am. Soc. Inform. Sci. Technol. 55(3), 238–245 (2004)

    Article  Google Scholar 

  61. C. W. Hsu, C. C. Chang, C. J. Lin, A practical guide to support vector classification (2003)

    Google Scholar 

  62. J.R. Landis, G.G. Koch, The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  63. K.A. Neuendorf, The content analysis guidebook (Sage Publications, Thousand Oaks, CA, 2002)

    Google Scholar 

  64. H. Zhang, The optimality of naive Bayes. A A 1(2), 3 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shu Zhou .

Editor information

Editors and Affiliations

Appendices

Appendix 1. Inter-Rater Agreement for Content Appraisal Features

Completeness (between evaluator 1 and 2)

Kappa value = 0.768

Table 7 Evaluator 1 & 2 on Completeness

Completeness (between evaluator 1 and 3)

Kappa value = 0.707

Table 8 Evaluator 1 & 3 on Completeness

Completeness (between evaluator 2 and 3)

Kappa value = 0.781

Table 9 Evaluator 2 & 3 on Completeness

Complexity (between evaluator 1 and 2)

Kappa value = 0.796

Table 10 Evaluator 1 & 2 on Complexity

Complexity (between evaluator 1 and 3)

Kappa value = 0.726

Table 11 Evaluator 1 & 3 on Complexity

Complexity (between evaluator 2 and 3)

Kappa value = 0.836

Table 12 Evaluator 2 & 3 on Complexity

Language error (between evaluator 1 and 2)

Kappa value = 0.703

Table 13 Evaluator 1 & 2 on Language error

Language error (between evaluator 1 and 3)

Kappa value = 0.780

Table 14 Evaluator 1 & 3 on Language error

Language error (between evaluator 2 and 3)

Kappa value = 0.749

Table 15 Evaluator 2 & 3 on Language error

Presentation (between evaluator 1 and 2)

Kappa value = 0.729

Table 16 Evaluator 1 & 2 on Presentation

Presentation (between evaluator 1 and 3)

Kappa value = 0.703

Table 17 Evaluator 1 & 3 on Presentation

Presentation (between evaluator 2 and 3)

Kappa value = 0.858

Table 18 Evaluator 2 & 3 on Presentation

Politeness (between evaluator 1 and 2)

Kappa value = 0.752

Table 19 Evaluator 1 & 2 on Politeness

Politeness (between evaluator 1 and 3)

Kappa value = 0.696

Table 20 Evaluator 1 & 3 on Politeness

Politeness (between evaluator 2 and 3)

Kappa value = 0.806

Table 21 Evaluator 2 & 3 on Politeness

Subjectivity (between evaluator 1 and 2)

Kappa value = 0.778

Table 22 Evaluator 1 & 2 on Subjectivity

Subjectivity (between evaluator 1 and 3)

Kappa value = 0.751

Table 23 Evaluator 1 & 3 on Subjectivity

Subjectivity (between evaluator 2 and 3)

Kappa value = 0.848

Table 24 Evaluator 2 & 3 on Subjectivity

Overall average kappa value = 0.765

Table 25 Overall average Cohen’s kappa coefficient for the evaluation of content appraisal features

Appendix 2. Accuracy and AUC from Tenfold Cross-Validation

Table 26 Accuracy and AUC from tenfold cross-validation for logistic regression
Table 27 Accuracy and AUC from tenfold cross-validation for SVM
Table 28 Accuracy and AUC from tenfold cross-validation for decision tree
Table 29 Accuracy and AUC from tenfold cross-validation for naïve Bayes
Table 30 Accuracy and AUC from tenfold cross-validation for k-NN

Appendix 3. ROC Curves from Tenfold Cross-Validation

Fig. 10
figure a

ROC curves for logistic regression

Fig. 11
figure b

ROC curves for SVM

Fig. 12
figure c

ROC curves for decision tree

Fig. 13
figure d

ROC curves for naïve Bayes

Fig. 14
figure e

ROC curves for k-NN

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Zhou, S., Fong, S. (2016). Exploring the Feature Selection-Based Data Analytics Solutions for Text Mining Online Communities by Investigating the Influential Factors: A Case Study of Programming CQA in Stack Overflow. In: Hung, P. (eds) Big Data Applications and Use Cases. International Series on Computer Entertainment and Media Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-30146-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30146-4_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-30144-0

  • Online ISBN: 978-3-319-30146-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics