Advertisement

Informatik Spektrum

, Volume 42, Issue 4, pp 266–286 | Cite as

Same-Same But Different: On Understanding Duplicates in Stack Overflow

  • Mathias EllmannEmail author
HAUPTBEITRAG ON UNDERSTANDING DUPLICATES IN STACK OVERFLOW
  • 18 Downloads

Abstract

Stack Overflow (SO) is one of the most popular online sites for asking and answering developers’ questions. New posts that cover exactly the same knowledge as previously posted questions get closed and deleted by the community. However, new posts that are very similar to previous questions but which are phrased slightly different are kept and tagged as duplicates: since they might include additional information, hints, or keywords. In this paper, we study exact duplicates and similar duplicates in SO in order to get insights about their properties and content and to understand how the community distinguishes useful from useless (i. e. to be deleted) redundant knowledge. We identified several interesting trends. Unique questions are significantly longer than others. Original questions get answered faster, include more answers, and get more frequently viewed than exact and similar duplicates. When comparing the overlapped text in duplicate pairs, we found almost no difference between exact and similar duplicates. In both cases, about 20–25 % of the question text and 40 % of the tags are identical in an original and its duplicate. However, the answers of the duplicates seem much more diverse with only 5–6 % repeated text.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2016) Mining duplicate questions in stack overflow. In: Proceedings of the 13th International Conference on Mining Software Repositories. ACM, pp 402–412Google Scholar
  2. 2.
    Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, pp 183–192Google Scholar
  3. 3.
    Anderson A, Huttenlocher D, Kleinberg J, Leskovec J (2012) Discovering value from community activity on focused question answering sites: a case study of stack overflow. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 850–858Google Scholar
  4. 4.
    Asaduzzaman M, Mashiyat AS, Roy CK, Schneider KA (2013) Answering questions about unanswered questions of stack overflow. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 97–100Google Scholar
  5. 5.
    Atwood J (2009) Handling Duplicate QuestionsGoogle Scholar
  6. 6.
    Barua A, Thomas SW, Hassan AE (2014) What are developers talking about? An analysis of topics and trends in stack overflow. Empirical Softw Eng 19(3):619–654CrossRefGoogle Scholar
  7. 7.
    Bettenburg N, Premraj R, Zimmermann T, Kim S (2008) Duplicate bug reports considered harmful ... really? In: IEEE international conference on Software Maintenance, ICSM 2008. IEEE, pp 337–345Google Scholar
  8. 8.
    Bird C, Menzies T, Zimmermann T (2015) The Art and Science of Analyzing Software Data. ElsevierGoogle Scholar
  9. 9.
    Bird S, Klein E, Loper E (2009) Natural Language Processing with Python. O’Reilly Media, Inc.Google Scholar
  10. 10.
    Bogdanova D, Nogueira dos Santos C, Barbosa L, Zadrozny B (2015) Detecting semantically equivalent questions in online user forums. CoNLL 123:2015Google Scholar
  11. 11.
    Bruegge B, Dutoit AH (2004) Object-Oriented Software Engineering Using UML, Patterns and Java-(Required). Prentice HallGoogle Scholar
  12. 12.
    Burke RD, Hammond KJ, Kulyukin V, Lytinen SL, Tomuro N, Schoenberg S (1997) Question answering from frequently asked question files: Experiences with the faq finder system. AI magazine 18(2):57Google Scholar
  13. 13.
    Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci Tech 41(6):391–407CrossRefGoogle Scholar
  14. 14.
    Oxford Dictionaries (2017) Definition of an artefact. https://en.oxforddictionaries.com/definition/artefactGoogle Scholar
  15. 15.
    Dumais ST, Furnas GW, Landauer TK, Deerwester S, Harshman R (1988) Using latent semantic analysis to improve access to textual information. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’88, New York, NY, USA. ACM, pp 281–285Google Scholar
  16. 16.
    Ellmann M (2018) Natural language processing (nlp) applied on issue trackers. In: Proceedings of the 4th ACM SIGSOFT International Workshop on NLP for Software Engineering. ACM, pp 38–41Google Scholar
  17. 17.
    Ellmann M, Oeser A, Fucci D, Maalej W (2007) Find, understand, and extend development screencasts on youtube. In: Proceedings of the 3rd ACM SIGSOFT International Workshop on Software Analytics, SWAN 2017, New York, NY, USA. ACM, pp 1–7Google Scholar
  18. 18.
    Fritz T, Murphy GC (2010) Using information fragments to answer the questions developers ask. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, pp 175–184Google Scholar
  19. 19.
    Furnas GW, Landauer TK, Gómez LM, Dumais ST (1987) The vocabulary problem in human-system communication. Commun ACM 30(11):964–971CrossRefGoogle Scholar
  20. 20.
    Glassman EL, Zhang T, Hartmann B, Kim M (2018) Visualizing api usage examples at scale. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, p 580Google Scholar
  21. 21.
    Gómez C, Cleary B, Singer L (2013) A study of innovation diffusion through link sharing on stack overflow. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 81–84Google Scholar
  22. 22.
    Gottipati S, Lo D, Jiang J (2011) Finding relevant answers in software forums. In: Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, pp 323–332Google Scholar
  23. 23.
    Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, pp 392–401Google Scholar
  24. 24.
    Hoffman M, Bach FR, Blei DM (2010) Online learning for latent dirichlet allocation. Advances in Neural Information Processing Systems. pp 856–864Google Scholar
  25. 25.
    Jaccard P (1902) Lois de distribution florale dans la zone alpine. Bull Soc Vaudoise Sci Nat 38:69–130Google Scholar
  26. 26.
    Jaccard P (1912) The distribution of the flora in the alpine zone. New Phytol 11(2):37–50CrossRefGoogle Scholar
  27. 27.
    Jährling C (2015) Monitoring Developer’s Actions to Generate a Question in Stack Overflow, Technical report. University Hamburg, Department of InformaticsGoogle Scholar
  28. 28.
    Kincaid JP, Fishburne Jr RP, Rogers RL, Chissom BS (1975) Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel, Technical report. DTIC DocumentGoogle Scholar
  29. 29.
    Ko AJ, DeLine R, Venolia G (2007) Information needs in collocated software development teams. In: 29th International Conference on Software Engineering 2007, ICSE 2007. IEEE, pp 344–353Google Scholar
  30. 30.
    Kodhai E, Kanmani S, Kamatchi A, Radhika R, Vijaya Saranya B (2010) Detection of type-1 and type-2 code clones using textual analysis and metrics. In: 2010 International Conference on Recent Trends in Information, Telecommunication and Computing (ITC). IEEE, pp 241–243Google Scholar
  31. 31.
    Lahtinen E, Ala-Mutka K, Järvinen H-M (2005) A study of the difficulties of novice programmers. ACM Sigcse Bull 37:14–18CrossRefGoogle Scholar
  32. 32.
    Leskovec J, Rajaraman A, Ullman JD (2014) Mining of Massive Datasets. Cambridge University PressGoogle Scholar
  33. 33.
    Lethbridge TC, Singer J, Forward A (2003) How software engineers use documentation: The state of the practice. IEEE Software 20(6):35–39CrossRefGoogle Scholar
  34. 34.
    Maalej W (2009) Task-first or context-first? tool integration revisited. In: Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society, pp 344–355Google Scholar
  35. 35.
    Maalej W, Ellmann M (2015) On the similarity of task contexts. In: Proceedings of the Second International Workshop on Context for Software Development. IEEE Press, pp 8–12Google Scholar
  36. 36.
    Maalej W, Ellmann M, Robbes R (2016) Using contexts similarity to predict relationships between tasks. J Syst SoftwGoogle Scholar
  37. 37.
    Maalej W, Happel H-J (2010) Can development work describe itself? In: 7th IEEE Working Conference on Mining Software Repositories (MSR). IEEE, pp 191–200Google Scholar
  38. 38.
    Maalej W, Robillard MP (2013) Patterns of knowledge in api reference documentation. IEEE T Software Eng 39(9):1264–1282CrossRefGoogle Scholar
  39. 39.
    Maalej W, Tiarks R, Roehm T, Koschke R (2014) On the comprehension of program comprehension. ACM T Softw Eng Meth 23(4):31Google Scholar
  40. 40.
    MacLeod L, Storey M-A, Bergen A (2015) Code, camera, action: how software developers document and share program knowledge using youtube. In: IEEE 23rd International Conference on Program Comprehension (ICPC). IEEE, pp 104–114Google Scholar
  41. 41.
    Mamykina L, Manoim B, Mittal M, G Hripcsak, Hartmann B (2011) Design lessons from the fastest q&a site in the west. In: Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, pp 2857–2866Google Scholar
  42. 42.
    Mizobuchi Y, Takayama K (2017) Two improvements to detect duplicates in stack overflow. In: IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp 563–564Google Scholar
  43. 43.
    Muthmann K, Petrova A (2014) An automatic approach for identifying topical near-duplicate relations between questions from social media q/a sitesGoogle Scholar
  44. 44.
    Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. ACM, pp 70–79Google Scholar
  45. 45.
    Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ASE 2012, New York, NY, USA. ACM, pp 70–79Google Scholar
  46. 46.
    Page A (2007) Duplicate bugs. https://blogs.msdn.microsoft.com/alanpa/2007/08/01/duplicate-bugs/Google Scholar
  47. 47.
    Panjer LD (2007) Predicting eclipse bug lifetimes. In: Proceedings of the Fourth International Workshop on mining software repositories. IEEE Computer Society, p 29Google Scholar
  48. 48.
    Park H, Lee S-C, Lee S-H, Kim S-W (2010) Centralmatch: A fast and accurate method to identify blog-duplicates. In: 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol 1. IEEE, pp 112–119Google Scholar
  49. 49.
    Ponzanelli L, Mocci A, Bacchelli A, Lanza M, Fullerton D (2014) Improving low quality stack overflow post detection. In: 2014 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp 541–544Google Scholar
  50. 50.
    Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRefGoogle Scholar
  51. 51.
    pyLDAvis (2014) Python library for interactive topic model visualizationGoogle Scholar
  52. 52.
    Rakha MS, Shang W, Hassan AE (2016) Studying the needed effort for identifying duplicate issues. Empirical Softw Eng 21(5):1960–1989CrossRefGoogle Scholar
  53. 53.
    Řehuřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta. ELRA, pp 45–50. http://is.muni.cz/publication/884893/enGoogle Scholar
  54. 54.
    Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci Comput Program 74(7):470–495MathSciNetCrossRefzbMATHGoogle Scholar
  55. 55.
    Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, pp 499–510Google Scholar
  56. 56.
    Schnecke M (2015) An empirical study to improve the quality of developer’s q&as in stack overflow, Technical report. University Hamburg, Department of InformaticsGoogle Scholar
  57. 57.
    Silva RFG, Paixão K, de Almeida Maia M (2018) Duplicate question detection in stack overflow: A reproducibility study. In: 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp 572–581Google Scholar
  58. 58.
    Singer J, Lethbridge T, Vinson N, Anquetil N (2010) An examination of software engineering work practices. In: CASCON First Decade High Impact Papers. IBM Corp., pp 174–188Google Scholar
  59. 59.
    Sinha VS, Mani S, Gupta M (2013) Exploring activeness of users in qa forums. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 77–80Google Scholar
  60. 60.
    Sun C, Lo D, Khoo S-C, Jiang J (2011) Towards more accurate retrieval of duplicate reports. In: Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, ASE ’11, Washington, DC, USA. IEEE Computer Society, pp 253–262Google Scholar
  61. 61.
    Sun C, Lo D, Wang X, Jiang J, Khoo S-C (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering – Volume 1, ICSE ’10, New York, NY, USA. ACM, pp 45–54Google Scholar
  62. 62.
    Sureka A, Jalote P (2010) Detecting duplicate bug report using character n-gram-based features. In: 17th Asia Pacific Software Engineering Conference (APSEC). IEEE, pp 366–374Google Scholar
  63. 63.
    Tiarks R (2015) How-To Software Knowledge. Verlag Dr. HutGoogle Scholar
  64. 64.
    Tiarks R, Maalej W (2014) How does a typical tutorial for mobile development look like? In: Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, pp 272–281Google Scholar
  65. 65.
    Timmann I (2015) An empirical study towards a quality model for faqs in software development, Technical report. University Hamburg, Department of InformaticsGoogle Scholar
  66. 66.
    Treude C, Barzilay O, Storey M-A (2011) How do programmers ask and answer questions on the web?: Nier track. In: 33rd International Conference on Software Engineering (ICSE). IEEE, pp 804–807Google Scholar
  67. 67.
    Treude C, Robillard MP (2016) Augmenting api documentation with insights from stack overflow. In: IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, pp 392–403Google Scholar
  68. 68.
    Robillard MP, Maalej W (2013) Patterns of knowledge in api reference documentation. IEEE T Softw Eng 39(9)Google Scholar
  69. 69.
    Wang S, Lo D, Jiang L (2013) An empirical study on developer interactions in stackoverflow. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing, SAC ’13, New York, NY, USA. ACM, pp 1019–1024Google Scholar
  70. 70.
    Wang X, Lo D, Jiang J, Zhang L, Mei H (2009) Extracting paraphrases of technical terms from noisy parallel software corpora. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. Association for Computational Linguistics, pp 197–200Google Scholar
  71. 71.
    Stack Overflow Community Wiki (2016) How should duplicate questions be handled? https://meta.stackexchange.com/questions/10841/how-should-duplicate-questions-be-handledGoogle Scholar
  72. 72.
    Xu B, Ye D, Xing Z, Xia X, Chen G, Li S (2016) Predicting semantically linkable knowledge in developer online forums via convolutional neural network. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, pp 51–62Google Scholar
  73. 73.
    Zhang Y, Lo D, Xia X, Sun J-L (2015) Multi-factor duplicate question detection in stack overflow. J Comput Sci Technol 30(5):981–997CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Deutschland, ein Teil von Springer Nature 2019

Authors and Affiliations

  1. 1.HamburgGermany

Personalised recommendations