Information Retrieval Journal

, Volume 21, Issue 1, pp 1–23 | Cite as

Retrieving and classifying instances of source code plagiarism

  • Debasis Ganguly
  • Gareth J. F. Jones
  • Aarón Ramírez-de-la-Cruz
  • Gabriela Ramírez-de-la-Rosa
  • Esaú Villatoro-Tello


Automatic detection of source code plagiarism is an important research field for both the commercial software industry and within the research community. Existing methods of plagiarism detection primarily involve exhaustive pairwise document comparison, which does not scale well for large software collections. To achieve scalability, we approach the problem from an information retrieval (IR) perspective. We retrieve a ranked list of candidate documents in response to a pseudo-query representation constructed from each source code document in the collection. The challenge in source code document retrieval is that the standard bag-of-words (BoW) representation model for such documents is likely to result in many false positives being retrieved, because of the use of identical programming language specific constructs and keywords. To address this problem, we make use of an abstract syntax tree (AST) representation of the source code documents. While the IR approach is efficient, it is essentially unsupervised in nature. To further improve its effectiveness, we apply a supervised classifier (pre-trained with features extracted from sample plagiarized source code pairs) on the top ranked retrieved documents. We report experiments on the SOCO-2014 dataset comprising 12K Java source files with almost 1M lines of code. Our experiments confirm that the AST based approach produces significantly better retrieval effectiveness than a standard BoW representation, i.e., the AST based approach is able to identify a higher number of plagiarized source code documents at top ranks in response to a query source code document. The supervised classifier, trained on features extracted from sample plagiarized source code pairs, is shown to effectively filter and thus further improve the ranked list of retrieved candidate plagiarized documents.


Source code plagiarism detection Field based indexing and retrieval Lexical, Structural and stylistic features Document representation 



The authors would like to thank to Enrique Flores, Paolo Rosso and Lidia Moreno for providing us with important details regarding the participating systems in the SOCO 2014 shared task. The first two authors are supported by Science Foundation Ireland (SFI) as a part of the ADAPT Centre at DCU (Grant No.: 13/RC/2106). The work of the last three authors was partially funded by CONACyT under the Thematic Networks program (Language Technologies Thematic Network Project No. 260178, 271622). Additionally, they would also like to thank to UAM Cuajimalpa and SNI-CONACyT for their support.


  1. Baer, N., & Zeidman, R. (2012). Measuring whitespace pattern sequence as an indication of plagiarism. Journal of Software Engineering and Applications, 5(4), 249–254.CrossRefGoogle Scholar
  2. Baxter, I. D., Yahin, A., Moura, L., Sant’Anna, M., & Bier, L. (1998). Clone detection using abstract syntax trees. In Proceedings of the international conference on software maintenance, ICSM ’98 (p. 368).Google Scholar
  3. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.CrossRefMATHGoogle Scholar
  4. Burrows, S., Tahaghoghi, S. M . M., & Zobel, J. (2007). Efficient plagiarism detection for large code repositories. Software: Practice and Experience, 37(2), 151–175.Google Scholar
  5. Chae, D.-K., Ha, J., Kim, S.-W., Kang, B., & Im, E. G. (2013a). Software plagiarism detection: A graph-based approach. In Proceedings of the 22nd ACM international conference on information and knowledge management, CIKM ’13 (pp. 1577–1580).Google Scholar
  6. Chae, D.-K., Kim, S.-W., Ha, J., Lee, S.-C., & Woo, G. (2013b). Software plagiarism detection via the static api call frequency birthmark. In Proceedings of the 28th annual ACM symposium on applied computing, SAC’13 (pp. 1639–1643).Google Scholar
  7. Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on theory of computing, STOC ’02 (pp. 380–388). New York, NY, USA: ACM.Google Scholar
  8. Cosma, G., & Joy, M. (2013). Evaluating the performance of lsa for source-code plagiarism detection. Informatica, 36(4), 409–424.Google Scholar
  9. Faidhi, J. A. W., & Robinson, S. K. (1987). An empirical approach for detecting program similarity and plagiarism within a university programming environment. Computers and Education, 11(1), 11–19.CrossRefGoogle Scholar
  10. Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15, 3133–3181.MathSciNetMATHGoogle Scholar
  11. Flores, E., Barrón-Cedeño, A., Rosso, P., & Moreno, L. (2011). Towards the detection of cross-language source code reuse. In Proceedings of the 16th international conference on applications of natural language to information systems, NLDB 2011 (pp. 250–253).Google Scholar
  12. Flores, E., Barrede, A., Moreno, L., & Rosso, P. (2014a). Uncovering source code reuse in large-scale academic environments. Computer Applications in Engineering Education, 23, 383–390.CrossRefGoogle Scholar
  13. Flores, E., Rosso, P., Moreno, L., & Villatoro-Tello, E. (2014b). PAN@FIRE: Overview of SOCO track on the detection of source code re-use. In Working notes of the forum for information retrieval evaluation, FIRE 2014.Google Scholar
  14. Flores, E., Rosso, P., Moreno, L., & Villatoro-Tello, E. (2014c). Pan@fire: Overview of soco track on the detection of source code re-use. In Proceedings of the forum for information retrieval evaluation, FIRE 2014.Google Scholar
  15. Fox, E. A., Koushik, M. P., Shaw, J. A., Modlin, R., & Rao, D. (1992). Combining evidence from multiple searches. In Proceedings of the first text REtrieval conference, TREC 1992, Gaithersburg, Maryland (pp. 319–328), November 4–6, 1992.Google Scholar
  16. Grieve, J. (2007). Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing, 22(3), 251–270.CrossRefGoogle Scholar
  17. Hiemstra, D. (2000). Using language models for information retrieval. Ph.D. thesis, CTIT, AE Enschede.Google Scholar
  18. Jones, J. (2003). Abstract syntax tree implementation idioms. In Proceedings of PLP ’03.Google Scholar
  19. Kim, J. & Croft, W. B. (2012). A field relevance model for structured document retrieval. In Proceedings of the 34th European conference on IR research, ECIR 2012 (pp. 97–108).Google Scholar
  20. Marinescu, D., Baicoianu, A., & Dimitriu, S. (2012). Software for plagiarism detection in computer source code. In Proceedings of the 7th international conference on virtual learning (Vol. 156, pp. 373–379).Google Scholar
  21. Narayanan, S., & Simi, S. (2012). Source code plagiarism detection and performance analysis using fingerprint based distance measure method. In Procceedings of the 7th international conference on computer science and education, ICCSE ’12 (pp. 1065–1068).Google Scholar
  22. Neamtiu, I., Foster, J. S., & Hicks, M. (2005). Understanding source code evolution using abstract syntax tree matching. Proceedings of the 2005 International Workshop on Mining Software Repositories, MSR’05, 30(4), 1–5.Google Scholar
  23. Ogilvie, P., & Callan, J. (2003). Combining document representations for known-item search. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’03 (pp. 143–150). New York, NY, USA: ACM.Google Scholar
  24. Ponte, J. M. (1998). A language modeling approach to information retrieval. Ph.D. thesis, University of Massachusetts.Google Scholar
  25. Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., & Stein, B. (2014). Overview of the 6th international competition on plagiarism detection. In Working notes for CLEF 2014 conference (pp. 845–876).Google Scholar
  26. Prechelt, L., Malpohl, G., & Philippsen, M. (2002). Finding plagiarisms among a set of programs with jplag. Journal of Universal Computer Science J-UCS, 8(11), 1016–1038.Google Scholar
  27. Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieva, SIGIR’05 (pp. 162–169). New York, NY, USA.Google Scholar
  28. Schleimer, S., Wilkerson, D. S., & Aiken, A. (2003). Winnowing: Local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (pp. 76–85). ACM.Google Scholar
  29. Stein, B., Potthast, M., Rosso, P., Barredeo, A., Stamatatos, E., & Koppel, M. (2011). Fourth international workshop on uncovering plagiarism, authorship, and social software misuse. In SIGIR Forum (Vol. 45, pp. 45–48).Google Scholar
  30. Takaki, T., Fujii, A., & Ishikawa, T. (2004). Associative document retrieval by query subtopic analysis and its application to invalidity patent search. In Proceedings of the thirteenth ACM international conference on information and knowledge management, CIKM ’04 (pp. 399–405).Google Scholar
  31. Xue, X. & Croft, W. B. (2009). Automatic query generation for patent search. In Proceedings of the 18th ACM conference on information and knowledge management, CIKM ’09 (pp. 2037–2040). New York, NY, USA: ACM.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.ADAPT Centre, School of ComputingDublin City UniversityDublinIreland
  2. 2.Language and Reasoning Research Group, Information Technologies DepartmentUniversidad Autónoma MetropolitanaCuajimalpa, MéxicoMexico

Personalised recommendations