Selecting Features in Origin Analysis

  • Pam Green
  • Peter C.R. Lane
  • Austen Rainer
  • Sven-Bodo Scholz
Conference paper


When applying a machine-learning approach to develop classifiers in a new domain, an important question is what measurements to take and how they will be used to construct informative features. This paper develops a novel set of machine-learning classifiers for the domain of classifying files taken from software projects; the target classifications are based on origin analysis. Our approach adapts the output of four copy-analysis tools, generating a number of different measurements. By combining the measures and the files on which they operate, a large set of features is generated in a semi-automatic manner. After which, standard attribute selection and classifier training techniques yield a pool of high quality classifiers (accuracy in the range of 90%), and information on the most relevant features.


Origin Analysis Feature Construction Clone Detection Code Clone Source Code Entity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ammann, C.M.: Duplo - code clone detection tool. Sourceforge project (2005)
  2. 2.
    Antoniol, G., Penta, M.D., Merlo, E.: An automatic approach to identify class evolution discontinuities. In: IWPSE ’04: Proceedings of the Principles of Software Evolution, 7th International Workshop, pp. 31–40. IEEE Computer Society, Washington, DC, USA (2004)CrossRefGoogle Scholar
  3. 3.
    Glocer, K., Eads, D., Theiler, J.: Online feature selection for pixel classification. In: ICML ’05: Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany (2005)Google Scholar
  4. 4.
    Godfrey, M.W., Zou, L.: Using origin analysis to detect merging and splitting of source code entities. IEEE Trans. Software Eng. 31(2), 166–181 (2005)CrossRefGoogle Scholar
  5. 5.
    Green, P., Lane, P.C.R., Rainer, A., Scholz, S.B.: Building classifiers to identify split files. In: P. Perner (ed.) MLDM Posters, pp. 1–8. IBaI Publishing (2009)Google Scholar
  6. 6.
    Green, P., Lane, P.C.R., Rainer, A., Scholz, S.B.: Analysing ferret XML reports to estimate the density of copied code. Tech. Rep. 501, Science and Technology Research Institute, University of Hertfordshire, UK (2010)Google Scholar
  7. 7.
    Green, P., Lane, P.C.R., Rainer, A., Scholz, S.B.: Unscrambling code clones for one-to-one matching of duplicated code. Tech. Rep. 502, Science and Technology Research Institute, University of Hertfordshire, UK (2010)Google Scholar
  8. 8.
    Hall, M.A.: Correlation-based feature subset selection for machine learning. Ph.D. thesis, University of Waikato, Hamilton, New Zealand (1998)Google Scholar
  9. 9.
    Harris, S.: Simian. Copyright (c) 2003-08 RedHill Consulting Pty. Ltd.
  10. 10.
    Kamiya, T., Kusumoto, S., Inoue, K.: CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng. 28(7), 654–670 (2002)CrossRefGoogle Scholar
  11. 11.
    Kim, M., Notkin, D.: Program element matching for multi-version program analyses. In: MSR ’06: Proceedings of the 2006 international workshop on Mining software repositories, pp. 58–64. ACM, New York, NY, USA (2006)CrossRefGoogle Scholar
  12. 12.
    Kim, S., Pan, K., Jr., E.J.W.: When functions change their names: Automatic detection of origin relationships. In: 12th Working Conference on Reverse Engineering (WCRE 2005), 7-11 November 2005, Pittsburgh, PA, USA, pp. 143–152. IEEE Computer Society (2005)Google Scholar
  13. 13.
    Kramer, S., de Raedt, L.: Feature construction with version spaces for biochemical applications. In ICML ’01: Proceedings of the 18th International Conference on Machine Learning, (2001)Google Scholar
  14. 14.
    Krawiec, K.: Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genetic Programming and Evolvable Machines 3, 329–343 (2002)MATHCrossRefGoogle Scholar
  15. 15.
    Rainer, A.W., Lane, P.C.R., Malcolm, J.A., Scholz, S.B.: Using n-grams to rapidly characterise the evolution of software code. In: The Fourth International ERCIM Workshop on Software Evolution and Evolvability (2008)Google Scholar
  16. 16.
    Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. Comput. Program. 74(7), 470–495 (2009)MATHCrossRefGoogle Scholar
  17. 17.
    Sourceforge open source software repository : (1998)
  18. 18.
    Witten, I.H., Frank, E.: Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufman, San Francisco, CA, USA (2000)
  19. 19.
    Yamamoto, T., Matsushita, M., Kamiya, T., Inoue, K.: Similarity of software system and its measurement tool SMMT. Systems and Computers in Japan 38(6), 91–99 (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  1. 1.University of HertfordshireHatfieldUK

Personalised recommendations