Abstract
Authorship attribution of source code is the task of deciding who wrote software, given its source code, when the author of the software is not explicitly known. There are numerous scenarios in which it is necessary to identify the author of a piece of software whose author is unknown, including software forensics investigations, plagiarism detection, and questions of software ownership. A number of methods for authorship attribution of source code have been presented in the past, including two state-of-the-art methods: SCAP and Burrows. Each of these two state-of-the-art methods was individually improved, and – as presented in this paper – an ensemble method was developed from them based on the Bayes optimal classifier. An empirical study was performed using a data set consisting of 7,231 open-source and textbook programs written in C++ and Java by thirty unique authors. The ensemble method successfully attributed 98.2% of all documents in the data set, compared to 88.9% by the Burrows baseline method and 91.0% by the SCAP baseline method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)
Spafford, E., Weeber, S.: Software forensics: Can we track code to its authors? Computers & Security (COMPSEC) 12(6), 585–595 (1993)
McCabe, D.: Levels of cheating and plagiarism remain high. Technical report, Center for Academic Integrity, Duke University (2005)
Bull, J., Collins, C., Coughlin, E., Sharp, D.: Technical Review of Plagiarism Detection Software Report.Technical report, Joint Information System Committee (2001)
Culwin, F., MacLeod, A., Lancaster, T.: Source Code Plagiarism in UK HE Computing Schools, Issues, Attitudes and Tools.Technical report, South Bank University (2001)
MacDonell, S., Gray, A., MacLennan, G., Sallis, P.: Software forensics for discriminating between program authors. In: Proceedings of the 6th International Conference on Neural Information Processing, pp. 66–71 (1999)
Krsul, I., Spafford, E.: Authorship analysis: Identifying the author of a program. Computers & Security (COMPSEC) 16(3), 233–257 (1997)
Ding, H., Samadzadeh, M.: Extraction of java program fingerprints for software authorship identification. The Journal of Systems and Software 72, 49–57 (2004)
Frantzeskou, G., Stamatatos, E., Gritzalis, S.: Supporting the cybercrime investigation process: Effective discrimination of source code authors based on byte-level information. In: Proceedings of the Second International Conference on E-business and Telecommunication Networks, pp. 283–290 (2005)
Frantzeskou, G., Stamatatos, E., Gritzalis, S., Katsikas, S.: Source code author identification based on n-gram author profiles. In: Maglogiannis, I., Karpouzis, K., Bramer, M. (eds.) Artificial Intelligence Applications and Innovations. IFIP AICT, vol. 204, pp. 508–515. Springer, Heidelberg (2006)
Frantzeskou, G., Stamatatos, E., Gritzalis, S., Katsikas, S.: Effective identification of source code authors using byte-level information. In: Proceedings of the Twenty-Eighth International Conference on Software Engineering, pp. 893–896 (2006)
Frantzeskou, G., Stamatatos, E., Gritzalis, S., Chaski, C., Howald, B.: Identifying authorship by byte-level n-grams: The source code author profile (SCAP) method. International Journal of Digital Evidence 6(1), 1–18 (2007)
Frantzeskou, G., MacDonell, S.G., Stamatatos, E., Gritzalis, S.: Examining the significance of high-level programming features in source code author classification. Journal of Systems and Software 81(3), 447–460 (2008)
Burrows, S., Tahaghoghi, S.: Source code authorship attribution using n-grams. In: Proceedings of the 12th Australasian Document Computing Symposium, pp. 32–39 (2007)
Burrows, S.: Source Code Authorship Attribution. Doctoral thesis, RMIT University, Melbourne, Victoria, Australia (2010)
Burrows, S., Uitdenbogerd, A., Turpin, A.: Comparing techniques for authorship attribution of source code. Software: Practice and Experience 44(1), 1–32 (2014)
Lange, R., Mancoridis, S.: Using code metric histograms and genetic algorithms to perform author identification for software forensics. In: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, pp. 2082–2089 (2007)
Kothari, J., Shevertalov, M., Stehle, E., Mancoridis, S.: A probabilistic approach to source code authorship identification. In: Proceedings of the Fourth International Conference on Information Technology, pp. 243–248 (2007)
Elenbogen, B., Seliya, N.: Detecting outsourced student programming assignments. Journal of Computing Sciences in Colleges 23(3), 50–57 (2008)
Shevertalov, M., Kothari, J., Stehle, E., Mancoridis, S.: On the use of discretized source code metrics for author identification. In: Proceedings of the 1st International Symposium on Search Based Software Engineering, pp. 69–78 (2009)
Tennyson, M.: A Replicated Comparative Study of Source Code Authorship Attribution. In: Proceedings of the International Workshop on Replication in Empirical Software Engineering Research, pp. 76–83 (2013)
Tennyson, M., Mitropoulos, F.: Choosing a Profile Length in the SCAP Method of Source Code Authorship Attribution. In: 2014 Proceedings of the IEEE Southeastcon (2014)
Tennyson, M., Mitropoulos, F.: Improving the Burrows Method of Source Code Authorship Attribution. In: Proceedings of the IADIS International Conference on Applied Computing, pp. 3–9 (2013)
Robertson, S., Walker, S.: Okapi/Keenbow at TREC-8. In: Proceedings of the 8th Text Retrieval Conference, pp. 151–162 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Tennyson, M.F., Mitropoulos, F.J. (2014). A Bayesian Ensemble Classifier for Source Code Authorship Attribution. In: Traina, A.J.M., Traina, C., Cordeiro, R.L.F. (eds) Similarity Search and Applications. SISAP 2014. Lecture Notes in Computer Science, vol 8821. Springer, Cham. https://doi.org/10.1007/978-3-319-11988-5_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-11988-5_25
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11987-8
Online ISBN: 978-3-319-11988-5
eBook Packages: Computer ScienceComputer Science (R0)