Skip to main content

A Bayesian Ensemble Classifier for Source Code Authorship Attribution

  • Conference paper
Similarity Search and Applications (SISAP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8821))

Included in the following conference series:

Abstract

Authorship attribution of source code is the task of deciding who wrote software, given its source code, when the author of the software is not explicitly known. There are numerous scenarios in which it is necessary to identify the author of a piece of software whose author is unknown, including software forensics investigations, plagiarism detection, and questions of software ownership. A number of methods for authorship attribution of source code have been presented in the past, including two state-of-the-art methods: SCAP and Burrows. Each of these two state-of-the-art methods was individually improved, and – as presented in this paper – an ensemble method was developed from them based on the Bayes optimal classifier. An empirical study was performed using a data set consisting of 7,231 open-source and textbook programs written in C++ and Java by thirty unique authors. The ensemble method successfully attributed 98.2% of all documents in the data set, compared to 88.9% by the Burrows baseline method and 91.0% by the SCAP baseline method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  2. Spafford, E., Weeber, S.: Software forensics: Can we track code to its authors? Computers & Security (COMPSEC) 12(6), 585–595 (1993)

    Article  Google Scholar 

  3. McCabe, D.: Levels of cheating and plagiarism remain high. Technical report, Center for Academic Integrity, Duke University (2005)

    Google Scholar 

  4. Bull, J., Collins, C., Coughlin, E., Sharp, D.: Technical Review of Plagiarism Detection Software Report.Technical report, Joint Information System Committee (2001)

    Google Scholar 

  5. Culwin, F., MacLeod, A., Lancaster, T.: Source Code Plagiarism in UK HE Computing Schools, Issues, Attitudes and Tools.Technical report, South Bank University (2001)

    Google Scholar 

  6. MacDonell, S., Gray, A., MacLennan, G., Sallis, P.: Software forensics for discriminating between program authors. In: Proceedings of the 6th International Conference on Neural Information Processing, pp. 66–71 (1999)

    Google Scholar 

  7. Krsul, I., Spafford, E.: Authorship analysis: Identifying the author of a program. Computers & Security (COMPSEC) 16(3), 233–257 (1997)

    Article  Google Scholar 

  8. Ding, H., Samadzadeh, M.: Extraction of java program fingerprints for software authorship identification. The Journal of Systems and Software 72, 49–57 (2004)

    Article  Google Scholar 

  9. Frantzeskou, G., Stamatatos, E., Gritzalis, S.: Supporting the cybercrime investigation process: Effective discrimination of source code authors based on byte-level information. In: Proceedings of the Second International Conference on E-business and Telecommunication Networks, pp. 283–290 (2005)

    Google Scholar 

  10. Frantzeskou, G., Stamatatos, E., Gritzalis, S., Katsikas, S.: Source code author identification based on n-gram author profiles. In: Maglogiannis, I., Karpouzis, K., Bramer, M. (eds.) Artificial Intelligence Applications and Innovations. IFIP AICT, vol. 204, pp. 508–515. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  11. Frantzeskou, G., Stamatatos, E., Gritzalis, S., Katsikas, S.: Effective identification of source code authors using byte-level information. In: Proceedings of the Twenty-Eighth International Conference on Software Engineering, pp. 893–896 (2006)

    Google Scholar 

  12. Frantzeskou, G., Stamatatos, E., Gritzalis, S., Chaski, C., Howald, B.: Identifying authorship by byte-level n-grams: The source code author profile (SCAP) method. International Journal of Digital Evidence 6(1), 1–18 (2007)

    Google Scholar 

  13. Frantzeskou, G., MacDonell, S.G., Stamatatos, E., Gritzalis, S.: Examining the significance of high-level programming features in source code author classification. Journal of Systems and Software 81(3), 447–460 (2008)

    Article  Google Scholar 

  14. Burrows, S., Tahaghoghi, S.: Source code authorship attribution using n-grams. In: Proceedings of the 12th Australasian Document Computing Symposium, pp. 32–39 (2007)

    Google Scholar 

  15. Burrows, S.: Source Code Authorship Attribution. Doctoral thesis, RMIT University, Melbourne, Victoria, Australia (2010)

    Google Scholar 

  16. Burrows, S., Uitdenbogerd, A., Turpin, A.: Comparing techniques for authorship attribution of source code. Software: Practice and Experience 44(1), 1–32 (2014)

    Google Scholar 

  17. Lange, R., Mancoridis, S.: Using code metric histograms and genetic algorithms to perform author identification for software forensics. In: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, pp. 2082–2089 (2007)

    Google Scholar 

  18. Kothari, J., Shevertalov, M., Stehle, E., Mancoridis, S.: A probabilistic approach to source code authorship identification. In: Proceedings of the Fourth International Conference on Information Technology, pp. 243–248 (2007)

    Google Scholar 

  19. Elenbogen, B., Seliya, N.: Detecting outsourced student programming assignments. Journal of Computing Sciences in Colleges 23(3), 50–57 (2008)

    Google Scholar 

  20. Shevertalov, M., Kothari, J., Stehle, E., Mancoridis, S.: On the use of discretized source code metrics for author identification. In: Proceedings of the 1st International Symposium on Search Based Software Engineering, pp. 69–78 (2009)

    Google Scholar 

  21. Tennyson, M.: A Replicated Comparative Study of Source Code Authorship Attribution. In: Proceedings of the International Workshop on Replication in Empirical Software Engineering Research, pp. 76–83 (2013)

    Google Scholar 

  22. Tennyson, M., Mitropoulos, F.: Choosing a Profile Length in the SCAP Method of Source Code Authorship Attribution. In: 2014 Proceedings of the IEEE Southeastcon (2014)

    Google Scholar 

  23. Tennyson, M., Mitropoulos, F.: Improving the Burrows Method of Source Code Authorship Attribution. In: Proceedings of the IADIS International Conference on Applied Computing, pp. 3–9 (2013)

    Google Scholar 

  24. Robertson, S., Walker, S.: Okapi/Keenbow at TREC-8. In: Proceedings of the 8th Text Retrieval Conference, pp. 151–162 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Tennyson, M.F., Mitropoulos, F.J. (2014). A Bayesian Ensemble Classifier for Source Code Authorship Attribution. In: Traina, A.J.M., Traina, C., Cordeiro, R.L.F. (eds) Similarity Search and Applications. SISAP 2014. Lecture Notes in Computer Science, vol 8821. Springer, Cham. https://doi.org/10.1007/978-3-319-11988-5_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11988-5_25

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11987-8

  • Online ISBN: 978-3-319-11988-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics