Advertisement

Empirical Software Engineering

, Volume 13, Issue 6, pp 601–643 | Cite as

Empirical evaluation of clone detection using syntax suffix trees

  • Raimar Falke
  • Pierre Frenzel
  • Rainer Koschke
Article

Abstract

Reusing software through copying and pasting is a continuous plague in software development despite the fact that it creates serious maintenance problems. Various techniques have been proposed to find duplicated redundant code (also known as software clones). A recent study has compared these techniques and shown that token-based clone detection based on suffix trees is fast but yields clone candidates that are often not syntactic units. Current techniques based on abstract syntax trees—on the other hand—find syntactic clones but are considerably less efficient. This paper describes how we can make use of suffix trees to find syntactic clones in abstract syntax trees. This new approach is able to find syntactic clones in linear time and space. The paper reports the results of a large case study in which we empirically compare the new technique to other techniques using the Bellon benchmark for clone detectors. The Bellon benchmark consists of clone pairs validated by humans for eight software systems written in C or Java from different application domains. The new contributions of this paper over the conference paper are the additional analysis of Java programs, the exploration of an alternative path that uses parse trees instead of abstract syntax trees, and the investigation of the impact on recall and precision when clone analyses insist on consistent parameter renaming.

Keywords

Software clone detection Redundancy Duplication Software maintenance Software evolution Program analysis 

Notes

Acknowledgements

We would like to thank Stefan Bellon for providing us with his benchmark and ccdiml, his support in the evaluation, and for comments on this paper. We also like to thank Felix Beckwermert for his support in the evaluation and Thilo Mende for comments on this paper. We would also like to thank the anonymous reviewers for their valuable comments.

References

  1. Antoniol G, Casazza G, Penta MD, Merlo E (2001) Modeling clones evolution through time series. In: International conference on software maintenance. IEEE CS Press, pp 273–280Google Scholar
  2. Antoniol G, Villano U, Merlo E, Penta M (2002) Analyzing cloning evolution in the linux kernel. Inf Softw Technol 44(13):755–765CrossRefGoogle Scholar
  3. Bailey J, Burd E (2002) Evaluating clone detection tools for use during preventative maintenance. In: SCAMGoogle Scholar
  4. Baker BS (1992) A program for identifying duplicated code. In: Computer science and statistics 24: Proceedings of the 24th symposium on the interfaceGoogle Scholar
  5. Baker BS (1995) On finding duplication and near-duplication in large software systems. In: Working conference on reverse engineering. IEEE CS PressGoogle Scholar
  6. Baker BS (1996) Parameterized pattern matching: algorithms and applications. JCSSGoogle Scholar
  7. Baker BS (2007) Finding clones with Dup: analysis of an experiment. IEEE Trans Softw Eng 33(9):608–621 (September)CrossRefGoogle Scholar
  8. Baker BS, Giancarlo R (2002) Sparse dynamic programming for longest common subsequence from fragments. J Algorithms 42(2):231–254 (February)MATHCrossRefMathSciNetGoogle Scholar
  9. Balazinska M, Merlo E, Dagenais M, Lague B, Kontogiannis K (1999) Measuring clone based reengineering opportunities. In: IEEE symposium on software metrics. IEEE CS Press, pp 292–303 (November)Google Scholar
  10. Balazinska M, Merlo E, Dagenais M, Lague B, Kontogiannis K (2000) Advanced clone-analysis to support object-oriented system refactoring. In: Working conference on reverse engineering, IEEE CS Press, pp 98–107 (October)Google Scholar
  11. Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: ICSMGoogle Scholar
  12. Bellon S (2002) Vergleich von Techniken zur Erkennung duplizierten Quellcodes. Master’s thesis, University of Stuttgart, GermanyGoogle Scholar
  13. Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation of clone detection tools. IEEE Trans Softw Eng 33(9):577–591 (September)CrossRefGoogle Scholar
  14. Bentley JL, Ottmann TA (1979) Algorithms for reporting and counting geometric intersections. IEEE Trans Comput C-28:643–647CrossRefGoogle Scholar
  15. Bruntink M, van Deursen A, Tourwe T, van Engelen R (2004) An evaluation of clone detection techniques for crosscutting concerns. In: International conference on software maintenance, pp 200–209Google Scholar
  16. Bruntink M, van Engelen R, Tourwe T (2005) On the use of clone detection for identifying crosscutting concern code. IEEE Trans Softw Eng 31(10):804–818CrossRefGoogle Scholar
  17. Chou A, Yang J, Chelf B, Hallem S, Engler DR (2001) An empirical study of operating system errors. In: Symposium on operating systems principles, pp 73–88Google Scholar
  18. Cordy JR (2003) Comprehending reality—practical barriers to industrial adoption of software maintenance automation. In: International workshop on program comprehension, IEEE CS PressGoogle Scholar
  19. Cordy JR, Dean TR, Synytskyy N (2004) Practical language-independent detection of near-miss clones. In: CASCON, IBM PressGoogle Scholar
  20. Di Lucca G, Di Penta M, Fasolino, A (2002) An approach to identify duplicated web pages. In: COMPSACGoogle Scholar
  21. Ducasse S, Rieger M, Demeyer S (1999) A language independent approach for detecting duplicated code. In: ICSMGoogle Scholar
  22. Fowler M (1999) Refactoring: improving the design of existing code. Addison Wesley, Boston, MA, USAGoogle Scholar
  23. Gitchell D, Tran N (1999) Sim: a utility for detecting similarity in computer programs. In: SIGCSE, ACM PressGoogle Scholar
  24. Godfrey M, Tu Q (2001) Growth, evolution and structural change in open source software. In: Workshop on principles of software evolution (September)Google Scholar
  25. Higo Y, Ueda Y, Kamiya T, Kusumoto S, Inoue K (2002) On software maintenance process improvement based on code clone analysis. In: International conference on product focused software process improvement. Lecture notes in computer science, vol 2559. SpringerGoogle Scholar
  26. Johnson JH (1993) Identifying redundancy in source code using fingerprints. In: CASCON, IBM PressGoogle Scholar
  27. Kamiya T, Kusumoto S, Inoue K (2002) CCFinder: a multi-linguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670CrossRefGoogle Scholar
  28. Kapser C, Godfrey M (2003a) A taxonomy of clones in source code: the reengineers most wanted list. In: Working conference on reverse engineering. IEEE CS PressGoogle Scholar
  29. Kapser C, Godfrey MW (2003b) Toward a taxonomy of clones in source code: a case study. In: Evolution of large scale industrial software architecturesGoogle Scholar
  30. Kapser C, Godfrey M (2005) Improved tool support for the investigation of duplication in software. Proceedings of the 21st IEEE international conference on software maintenanceGoogle Scholar
  31. Kapser C, Godfrey MW (2006) “Clones considered harmful” considered harmful. In: Working conference on reverse engineeringGoogle Scholar
  32. Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in OOPL. In: International symposium on empirical software engineering. IEEE CS Press, pp 83–92Google Scholar
  33. Kim M, Sazawal V, Notkin D, Murphy GC (2005) An empirical study of code clone genealogies. In: European software engineering conference and foundations of software engineering (ESEC/FSE)Google Scholar
  34. Komondoor R, Horwitz S (2001) Using slicing to identify duplication in source code. In: Proc. int. symposium on static analysisGoogle Scholar
  35. Kontogiannis K, Mori RD, Merlo E, Galler M, Bernstein M (1996) Pattern matching for clone and concept detection. Autom Softw Eng 3(1/2):79–108Google Scholar
  36. Koschke R, Girard JF, Würthner M (1998) Intermediate representations for reverse engineering. In: Working conference on reverse engineering. IEEE CS Press, pp 241–250Google Scholar
  37. Koschke R, Falke R, Frenzel P (2006) Clone detection using abstract syntax suffix trees. In: Working conference on reverse engineering. IEEE CS Press, pp 253–262Google Scholar
  38. Krinke J (2001) Identifying similar code with program dependence graphs. In: WCREGoogle Scholar
  39. Lague B, Proulx D, Mayrand J, Merlo E, Hudepohl J (1997) Assessing the benefits of incorporating function clone detection in a development process. In: International conference on software maintenance, pp 314–321Google Scholar
  40. Laguë B, Proulx D, Mayrand J, Merlo EM, Hudepohl J (1997) Assessing the benefits of incorporating function clone detection in a development process. In: ICSMGoogle Scholar
  41. Lanubile F, Mallardo T (2003) Finding function clones in web applications. In: Conference on software maintenance and reengineeringGoogle Scholar
  42. Leitao AM (2003) Detection of redundant code using R2D2. In: Workshop source code analysis and manipulation. IEEE CS PressGoogle Scholar
  43. Li Z, Lu S, Myagmar S, Zhou Y (2004) Cp-miner: a tool for finding copy-paste and related bugs in operating system code. In: Operating system design and implementation, pp 289–302Google Scholar
  44. Li Z, Lu S, Myagmar S, Zhou Y (2006) Copy-paste and related bugs in large-scale software code. IEEE Trans Softw Eng 32(3):176–192 (March)CrossRefGoogle Scholar
  45. Manber U, Myers G (1991) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22(5):935–948 (October)CrossRefMathSciNetGoogle Scholar
  46. Marcus A, Maletic J (2001) Identification of high-level concept clones in source code. In: Conference on automated software engineeringGoogle Scholar
  47. Mayrand J, Leblanc C, Merlo EM (1996) Experiment on the automatic detection of function clones in a software system using metrics. In: ICSM. IEEE Computer Society PressGoogle Scholar
  48. McCreight E (1976) A space-economical suffix tree construction algorithm. J ACM 32(2):262–272CrossRefMathSciNetGoogle Scholar
  49. Monden A, Nakae D, Kamiya T, Sato S, Matsumoto K (2002) Software quality analysis by code clones in industrial legacy software. In: IEEE symposium on software metrics, pp 87–94Google Scholar
  50. Prechelt L, Malpohl G, Philippsen M (2000) Jplag: finding plagiarisms among a set of programs. Technical report, University of Karlsruhe, Department of InformaticsGoogle Scholar
  51. Rieger M (2005) Effective clone detection without language barriers. Dissertation, University of Bern, SwitzerlandGoogle Scholar
  52. Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: Proceedings of the SIGMOD, pp 76–85Google Scholar
  53. Shannon CE (1984) A mathematical theory of communication. Bell Syst Tech J 27(379–423):623–656MathSciNetGoogle Scholar
  54. Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249–260MATHCrossRefMathSciNetGoogle Scholar
  55. Van Rysselberghe F, Demeyer S (2004) Evaluating clone detection techniques from a refactoring perspective. In: Conference on automated software engineeringGoogle Scholar
  56. Walter V, Seipel D, von Gudenberg JW, Fischer G (2004) Clone detection in source code by frequent itemset techniques. In: Workshop source code analysis and manipulationGoogle Scholar
  57. Yang W (1991) Identifying syntactic differences between two programs. Software Pract Ex 21(7):739–755CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  1. 1.University of BremenBremenGermany

Personalised recommendations