Abstract
Source code authorship identification is an issue of authorship identification from documents, and it is to identify authors of source codes or programs based on source code examples of programmers. The main applications of authorship identification of source codes include software intellectual property infringement, malicious code detection and software maintenance and update. This paper proposes an approach of constructing author profiles of programmers based on a logic model of continuous word-level n-gram and discrete word-level n-gram, and a multi-level context model about operations, loops, arrays and methods. Further, we employ the technique of sequential minimal optimization for support vector machine training to identify authorship of source codes. The advantage of author profiles in this paper can discover explicit and implicit personal programming preference patterns of and between keywords, identifiers, operators, statements, methods and classes. Experimental results on programs from two open source websites demonstrate that our approach achieves a high accuracy and outperforms the baseline methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kothari, J., Shevertalov, M., Stehle, E., et al.: A probabilistic approach to source code authorship identification. In: 4th International Conference on Information Technology, pp. 243–248 (2007)
Ding, H., Samadzadeh, M.H.: Extraction of java program fingerprints for software authorship identification. J. Syst. Softw. 72(1), 49–57 (2004)
Lange, R., Mancoridis, S.: Using code metric histograms and genetic algorithms to perform author identification for software forensics. In: 9th Annual Conference on Genetic and Evolutionary Computation, pp. 2082–2089 (2007)
Tennyson, M.F.: On improving authorship attribution of source code. In: International Conference on Digital Forensics and Cyber Crime, pp. 58–65 (2012)
Gray, A., Sallis, P., MacDonell, S.: Identified: a dictionary-based system for extracting source code metrics for software forensics. In: International Conference on Software Engineering: Education and Practice, pp. 252–259 (1998)
Zhang, C., Wu, X., Niu, Z., et al.: Authorship identification from unstructured texts. Knowl.-Based Syst. 66, 99–111 (2014)
Tennyson, M.F., Mitropoulos, F.J.: A bayesian ensemble classifier for source code authorship attribution. In: International Conference on Similarity Search and Applications, pp. 265–276 (2014)
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Spafford, E.H., Weeber, S.A.: Software forensics: tracking code to its authors. Comput. Secur. 12, 585–595 (1993)
Software Forensics. http://en.wikipedia.org/wiki/Software_forensics
Bandara, U., Wijayarathna, G.: Source code author identification with unsupervised feature learning. Pattern Recogn. Lett. 34(3), 330–334 (2013)
MacDonell, S., Gray, M., MacLennan, G., Sallis, P.: Software forensics for discriminating between program authors using case-based reasoning, feed-forward neural networks and multiple discriminant analysis. In: 6th International Conference on Neural Information Processing, pp. 66–71 (1999)
Burrows, S., Tahaghoghi, S.M.M.: Source code authorship attribution using n-grams. In: 12th Australasian Document Computing Symposium, pp. 32–39 (2007)
Burrows, S., Uitdenbogerd, A.L., Turpin, A.: Comparing techniques for authorship attribution of source code. Softw.: Pract. Exp. 44(1), 1–32 (2014)
Bandara, U., Wijayarathna, G.: Deep neural networks for source code author identification. In: International Conference on Neural Information Processing, pp. 368–375 (2013)
Frantzeskou, G., Stamatatos, E., Gritzalis, S.: Supporting the cybercrime investigation process: effective discrimination of source code authors based on byte-level information. In: 2nd International Conference on E-business and Telecommunication Networks, pp. 163–173 (2005)
Frantzeskou, G., Gritzalis, S., MacDonell, S.G.: Source code authorship analysis for supporting the cybercrime investigation process. In: 1st International Conference on E-business and Telecommunication Networks, pp. 85–92 (2004)
Krsul, I.: Authorship analysis: identifying the author of a program. Technical report TR-94-030, Purdue University (1994)
Burrows, S., Tahaghoghi, S.M.M., Zobel, J.: Efficient plagiarism detection for large code repositories. Softw.-Pract. Exp. 37(2), 151–175 (2007)
Burrows, S., Uitdenbogerd, A.L., Turpin, A.: Temporally robust software features for authorship attribution. In: 33rd Annual International Computer Software and Applications Conference, pp. 599–606 (2009)
Burrows, S.: Source code authorship attribution. Ph.D. thesis. RMIT University, Melbourne, Australia (2010)
Burrows, S., Uitdenbogerd, A.L., Turpin, A.: Application of information retrieval techniques for source code authorship attribution. In: 14th International Conference on Database Systems for Advanced Applications, pp. 699–713 (2009)
Frantzeskou, G., Stamatatos, E., Gritzalis, S., et al.: Identifying authorship by byte-level n-grams: the source code author profile (SCAP) method. Int. J. Digit. Evid. 6(1), 1–18 (2007)
Krsul, I., Spafford, E.H.: Authorship analysis: identifying the author of a program. Comput. Secur. 16(3), 233–257 (1997)
Elenbogen, B.S., Seliya, N.: Detecting outsourced student programming assignments. J. Comput. Sci. Coll. 23(3), 50–57 (2008)
Shevertalov, M., Kothari, J., Stehle, E., Mancoridis, S.: On the use of discretised source code metrics for author identification. In: 1st International Symposium on Search Based Software Engineering, pp. 69–78 (2009)
Platt, J.: Sequential minimal optimization: a fast algorithm for training support vector machines (1998). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.560
Sequential minimal optimization. http://en.wikipedia.org/wiki/Sequential_minimal_optimization
Sequential minimal optimization. http://blog.csdn.net/yclzh0522/article/details/6900707
Acknowledgments
This work was supported by the National Natural Science Foundation of China (NO. 61672098, NO. 61272361).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Zhang, C., Wang, S., Wu, J., Niu, Z. (2017). Authorship Identification of Source Codes. In: Chen, L., Jensen, C., Shahabi, C., Yang, X., Lian, X. (eds) Web and Big Data. APWeb-WAIM 2017. Lecture Notes in Computer Science(), vol 10366. Springer, Cham. https://doi.org/10.1007/978-3-319-63579-8_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-63579-8_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63578-1
Online ISBN: 978-3-319-63579-8
eBook Packages: Computer ScienceComputer Science (R0)