Empirical Software Engineering

, Volume 24, Issue 2, pp 902–936 | Cite as

Preventing duplicate bug reports by continuously querying bug reports

  • Abram HindleEmail author
  • Curtis Onuczko


Bug deduplication or duplicate bug report detection is a hot topic in software engineering information retrieval research, but it is often not deployed. Typically to de-duplicate bug reports developers rely upon the search capabilities of the bug report software they employ, such as Bugzilla, Jira, or Github Issues. These search capabilities range from simple SQL string search to IR-based word indexing methods employed by search engines. Yet too often these searches do very little to stop the creation of duplicate bug reports. Some bug trackers have more than 10% of their bug reports marked as duplicate. Perhaps these bug tracker search engines are not enough? In this paper we propose a method of attempting to prevent duplicate bug reports before they start: continuously querying. That is as the bug reporter types in their bug report their text is used to query the bug database to find duplicate or related bug reports. This continuously querying bug reports allows the reporter to be alerted to duplicate bug reports as they report the bug, rather than formulating queries to find the duplicate bug report. Thus this work ushers in a new way of evaluating bug report deduplication techniques, as well as a new kind of bug deduplication task. We show that simple IR measures can address this problem but also that further research is needed to refine this novel process that is integrate-able into modern bug report systems.


Duplicate bug reports Issue reports Deduplication Information retrieval Just in time Continuously querying bug reports Continuous query 



This work was funded by an NSERC Discovery Grant, NSERC Engage Grant, and a MITACS Accelerate Cluster Grant in conjunction with Bioware ULC. We would also like to thank prior reviewers and Ahmed Hassan.


  1. Aggarwal K, Timbers F, Rutgers T, Hindle A, Stroulia E, Greiner R (2017) Detecting duplicate bug reports with software engineering domain knowledge. Journal of Software: Evolution and Process 29:1–15. E1821 smr.1821CrossRefGoogle Scholar
  2. Aggarwal K, Rutgers T, Timbers F, Hindle A, Greiner R, Stroulia E (2015) Detecting duplicate bug reports with software engineering domain knowledge. In: 22nd international conference on software analysis, evolution and reengineering (SANER), 2015 IEEE, pp 211–220. IEEEGoogle Scholar
  3. Alipour A (2013) A contextual approach towards more accurate duplicate bug report detection. Master’s thesis University of AlbertaGoogle Scholar
  4. Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: Proceedings of the Tenth International Workshop on Mining Software Repositories, pp 183–192. IEEE PressGoogle Scholar
  5. Arasu A, Babu S, Widom J (2006) The cql continuous query language: semantic foundations and query execution. VLDB J 15(2):121–142CrossRefGoogle Scholar
  6. Asaduzzaman M, Roy CK, Schneider KA, Hou D (2014) Cscc: Simple, efficient, context sensitive code completion. In: 2014 IEEE International conference on software maintenance and evolution (ICSME), pp 71–80. IEEEGoogle Scholar
  7. Bettenburg N, Premraj R, Zimmermann T, Kim S (2008) Duplicate bug reports considered harmful really?. In: IEEE international conference on software maintenance, 2008. ICSM 2008, pp 337–345. IEEEGoogle Scholar
  8. Campbell JC, Santos EA, Hindle A (2016) The unreasonable effectiveness of traditional information retrieval in crash report deduplication. In: International Working Conference on Mining Software Repositories (MSR 2016), pp 269–280.
  9. Chandrasekaran S, Cooper O, Deshpande A, Franklin MJ, Hellerstein JM, Hong W, Krishnamurthy S, Madden SR, Reiss F, Shah MA (2003) Telegraphcq: Continuous dataflow processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD ’03. ACM, New York, pp 668–668.
  10. Chandrasekaran S, Franklin MJ (2002) Streaming queries over streaming data. In: Proceedings of the 28th International Conference on Very Large Data Bases, VLDB ’02. VLDB Endowment, pp 203–214.
  11. Deshmukh JMAK, Podder S, Sengupta S, Dubash N (2017) Towards accurate duplicate bug retrieval using deep learning techniques. In: 2017 IEEE International conference on software maintenance and evolution (ICSME), pp 115–124.
  12. Google (2016) Google suggestion service
  13. Haiduc S (2014) Supporting query formulation for text retrieval applications in software engineering. In: 30th IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, September 29 - October 3, 2014, pp 657–662. IEEE Computer Society.
  14. Harman M, Mansouri SA, Zhang Y (2012) Search-based software engineering: Trends, techniques and applications. ACM Comput Surv 45(1):11:1–11:61. CrossRefGoogle Scholar
  15. Jalbert N, Weimer W (2008) Automated duplicate detection for bug tracking systems. In: IEEE International Conference on dependable systems and networks with FTCS and DCC, 2008. DSN 2008, pp 52–61. IEEEGoogle Scholar
  16. Kao B, Garcia-Molina H (1994) An overview of real-time database systems. In: Real time computing, pp 261–282. SpringerGoogle Scholar
  17. Klein N, Corley CS, Kraft NA (2014) New features for duplicate bug detection. In: MSR, pp 324–327Google Scholar
  18. Lazar A, Ritchey S, Sharif B (2014) Improving the accuracy of duplicate bug report detection using textual similarity measures. In: Proceedings of the 11th Working Conference on Mining Software Repositories, pp 308–311. ACMGoogle Scholar
  19. Lukins SK, Kraft NA, Etzkorn LH (2008) Source code retrieval for bug localization using latent dirichlet allocation. In: Proceedings of the 2008 15th Working Conference on Reverse Engineering, WCRE ’08. IEEE Computer Society, Washington, pp 155–164.
  20. Manning CD, Schütze H (1999) Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge. zbMATHGoogle Scholar
  21. Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp 70–79. ACMGoogle Scholar
  22. Panichella A, Dit B, Oliveto R, Penta MD, Poshyvanyk D, Lucia AD (2016) Parameterizing and assembling ir-based solutions for SE tasks using genetic algorithms. In: IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016, pp 314–325. IEEE Computer Society.
  23. Ponzanelli L, Bacchelli A, Lanza M (2013) Seahawk: Stack overflow in the ide. In: Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13. IEEE Press, Piscataway, pp 1295–1298.
  24. Ponzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M (2014) Mining stackoverflow to turn the ide into a self-confident programming prompter. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014. ACM, New York, pp 102–111.
  25. Porter M (1980) An algorithm for suffix stripping. Program 14(3):130–137. CrossRefGoogle Scholar
  26. Rakha MS, Bezemer CP, Hassan AE (2017) Revisiting the performance evaluation of automated approaches for the retrieval of duplicate issue reports. IEEE Trans Softw Eng PP(99):1–1. Google Scholar
  27. Rakha MS, Bezemer CP, Hassan AE (2018) Revisiting the performance of automated approaches for the retrieval of duplicate reports in issue tracking systems that perform just-in-time duplicate retrieval Empirical Software EngineeringGoogle Scholar
  28. Rakha MS, Shang W, Hassan AE (2015) Studying the needed effort for identifying duplicate issues. Empirical Software Engineering pp 1–30.
  29. Řehůřek R, Sojka P (2010) Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, pp 45–50.
  30. Řehůřek R, Sojka P (2018) models.tfidfmodel — TF-IDF model. (retrieved, March 2018)
  31. Robertson S, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M (1995) Okapi at trec–3. In: Overview of the Third Text REtrieval Conference (TREC–3), pp 109–126. Gaithersburg, MD: NIST.
  32. Rocha H, De Oliveira G, Marques-Neto H, Valente MT (2015) Nextbug: a bugzilla extension for recommending similar bugs. Journal of Software Engineering Research and Development 3(1):1–14CrossRefGoogle Scholar
  33. Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: 29th international conference on Software engineering, 2007. ICSE 2007, pp 499–510. IEEEGoogle Scholar
  34. Sabor KK, Hamou-Lhadj A, Larsson A (2017) Durfex: a feature extraction technique for efficient detection of duplicate bug reports. In: 2017 IEEE international conference on software quality, reliability and security (QRS), pp 240–250. IEEEGoogle Scholar
  35. Shah MA, Hellerstein JM, Chandrasekaran S, Franklin MJ (2003) Flux: an adaptive partitioning operator for continuous query systems. In: Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405), pp 25–36.
  36. Sun C, Lo D, Khoo SC, Jiang J (2011) Towards more accurate retrieval of duplicate bug reports. In: Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, pp 253–262. IEEE Computer SocietyGoogle Scholar
  37. Sun C, Lo D, Wang X, Jiang J, Khoo SC (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, pp 45–54. ACMGoogle Scholar
  38. Sureka A, Jalote P (2010) Detecting duplicate bug report using character n-gram-based features. In: Software engineering conference (APSEC), 2010 17th asia pacific, pp 366–374. IEEEGoogle Scholar
  39. Tange O (2011) Gnu parallel - the command-line power tool. ;login: The USENIX Magazine 36(1), pp 42–47.
  40. Thung F, Kochhar PS, Lo D (2014) Dupfinder: Integrated tool support for duplicate bug report detection. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14. ACM, New York, pp 871–874
  41. Wang S, Lo D, Lawall J (2014) Compositional vector space models for improved bug localization. In: 2014 IEEE international conference on software maintenance and evolution (ICSME), pp 171–180. IEEEGoogle Scholar
  42. Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) An approach to detecting duplicate bug reports using natural language and execution information. In: Proceedings of the 30th international conference on Software engineering, pp 461–470. ACMGoogle Scholar
  43. White RW, Marchionini G (2007) Examining the effectiveness of real-time query expansion. Inf Process Manag 43(3):685–704. Special Issue on Heterogeneous and Distributed IRCrossRefGoogle Scholar
  44. Zhang Y, Lo D, Xia X, Sun JL (2015) Multi-factor duplicate question detection in stack overflow. J Comput Sci Technol 30(5):981–997. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computing ScienceUniversity of AlbertaEdmontonCanada
  2. 2.BioWare ULCEdmontonCanada

Personalised recommendations