On Leveraging Coding Habits for Effective Binary Authorship Attribution

  • Saed Alrabaee
  • Paria Shirani
  • Lingyu Wang
  • Mourad Debbabi
  • Aiman Hanna
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11098)


We propose BinAuthor, a novel and the first compiler-agnostic method for identifying the authors of program binaries. Having filtered out unrelated functions (compiler and library) to detect user-related functions, it converts user-related functions into a canonical form to eliminate compiler/compilation effects. Then, it leverages a set of features based on collections of authors’ choices made during coding. These features capture an author’s coding habits. Our evaluation demonstrated that BinAuthor outperforms existing methods in several respects. First, when tested on large datasets extracted from selected open-source C/C++ projects in GitHub, Google Code Jam events, and Planet Source Code contests, it successfully attributed a larger number of authors with a significantly higher accuracy: around \(90\%\) when the number of authors is 1000. Second, when the code was subjected to refactoring techniques, code transformation, or processing using different compilers or compilation settings, there was no significant drop in accuracy, indicating that BinAuthor is more robust than previous methods.



The authors thank the anonymous reviewers for their valuable comments. We also appreciate the help we received from Perry Jones in implementing BinAuthor. This research is the result of a fruitful collaboration between the Security Research Center (SRC) of Concordia University, Defence Research and Development Canada (DRDC) and Google under a National Defence/NSERC Research Program.

Supplementary material


  1. 1.
    The Google Code Jam (2008–2015).
  2. 2.
    GitHub-Build software better (2011).
  3. 3.
  4. 4.
    The materials supplement for the paper. Who Wrote This Code? Identifying the Authors of Program Binaries (2011).
  5. 5.
  6. 6.
  7. 7.
    Programmer De-anonymization from Binary Executables (2015).
  8. 8.
  9. 9.
  10. 10.
    C++ refactoring tools for visual studio (2016).
  11. 11.
  12. 12.
    Technical report, Resource 207: Kaspersky Lab Research proves that Stuxnet and Flame developers are connected, May 2012.
  13. 13.
    Contagio: malware dump, May 2016.
  14. 14.
    VirusSign: Malware Research & Data Center, Virus Free, May 2016.
  15. 15.
    Alrabaee, S., Saleem, N., Preda, S., Wang, L., Debbabi, M.: OBA2: an onion approach to binary code authorship attribution. Digit. Investig. 11, S94–S103 (2014)CrossRefGoogle Scholar
  16. 16.
    Alrabaee, S., Shirani, P., Debbabi, M., Wang, L.: On the feasibility of malware authorship attribution. In: Cuppens, F., Wang, L., Cuppens-Boulahia, N., Tawbi, N., Garcia-Alfaro, J. (eds.) FPS 2016. LNCS, vol. 10128, pp. 256–272. Springer, Cham (2017). Scholar
  17. 17.
    Alrabaee, S., Shirani, P., Wang, L., Debbabi, M.: SIGMA: a semantic integrated graph matching approach for identifying reused functions in binary code. Digit. Investig. 12, S61–S71 (2015)CrossRefGoogle Scholar
  18. 18.
    Alrabaee, S., Shirani, P., Wang, L., Debbabi, M.: FOSSIL: a resilient and efficient system for identifying FOSS functions in malware binaries. ACM Trans. Priv. Secur. (TOPS) 21(2), 8 (2018)Google Scholar
  19. 19.
    Caliskan-Islam, A., et al.: When coding style survives compilation: de-anonymizing programmers from executable binaries. Netw. Distrib. Syst. Secur. Symp. (NDSS) (2018)Google Scholar
  20. 20.
    Cilibrasi, R., Vitanyi, P.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005)MathSciNetCrossRefGoogle Scholar
  21. 21.
    David, Y., Partush, N., Yahav, E.: Similarity of binaries through re-optimization. In: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 79–94. ACM (2017)Google Scholar
  22. 22.
    Junod, P., Rinaldini, J., Wehrli, J., Michielin, J.: Obfuscator-LLVM: software protection for the masses. In: Proceedings of the 1st International Workshop on Software Protection, pp. 3–9. IEEE Press (2015)Google Scholar
  23. 23.
    Junttila, T.A., Kaski, P.: Engineering an efficient canonical labeling tool for large and sparse graphs. In: ALENEX, vol. 7, pp. 135–149. SIAM (2007)Google Scholar
  24. 24.
    Knuth, D.E.: Backus normal form vs. Backus Naur form. Commun. ACM 7(12), 735–736 (1964)CrossRefGoogle Scholar
  25. 25.
    Krsul, I., Spafford, E.H.: Authorship analysis: identifying the author of a program. Comput. Secur. 16(3), 233–257 (1997)CrossRefGoogle Scholar
  26. 26.
    Mahalanobis, P.C.: On the generalized distance in statistics. Proc. Natl. Inst. Sci. (Calcutta) 2, 49–55 (1936)zbMATHGoogle Scholar
  27. 27.
    Meng, X., Miller, B.P., Jun, K.-S.: Identifying multiple authors in a binary program. In: Foley, S.N., Gollmann, D., Snekkenes, E. (eds.) ESORICS 2017. LNCS, vol. 10493, pp. 286–304. Springer, Cham (2017). Scholar
  28. 28.
    Moran, N., Bennett, J.: Supply Chain Analysis: From Quartermaster to Sunshop, vol. 11. FireEye Labs, Milpitas (2013)Google Scholar
  29. 29.
    Nethercote, N., Seward, J.: Valgrind: a framework for heavyweight dynamic binary instrumentation. In: ACM SIGPLAN Notices, vol. 42, pp. 89–100. ACM (2007)Google Scholar
  30. 30.
    Palmer, G., et al.: A road map for digital forensic research. In: First Digital Forensic Research Workshop, Utica, New York, pp. 27–30 (2001)Google Scholar
  31. 31.
    Rajlich, V.: Software evolution and maintenance. In: Proceedings of the Future of Software Engineering, pp. 133–144. ACM (2014)Google Scholar
  32. 32.
    Rosenblum, N., Zhu, X., Miller, B.P.: Who wrote this code? Identifying the authors of program binaries. In: Atluri, V., Diaz, C. (eds.) ESORICS 2011. LNCS, vol. 6879, pp. 172–189. Springer, Heidelberg (2011). Scholar
  33. 33.
    Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85. ACM (2003)Google Scholar
  34. 34.
    Shirani, P., et al.: BINARM: scalable and efficient detection of vulnerabilities in firmware images of intelligent electronic devices. In: Giuffrida, C., Bardin, S., Blanc, G. (eds.) DIMVA 2018. LNCS, vol. 10885, pp. 114–138. Springer, Cham (2018). Scholar
  35. 35.
    Shirani, P., Wang, L., Debbabi, M.: BinShape: scalable and robust binary library function identification using function shape. In: Polychronakis, M., Meier, M. (eds.) DIMVA 2017. LNCS, vol. 10327, pp. 301–324. Springer, Cham (2017). Scholar
  36. 36.
    Shoshitaishvili, Y., et al.: SOK: (state of) the art of war: offensive techniques in binary analysis. In: 2016 IEEE Symposium on Security and Privacy, SP, pp. 138–157. IEEE (2016)Google Scholar
  37. 37.
    Spafford, E.H., Weeber, S.A.: Software forensics: can we track code to its authors? Comput. Secur. 12(6), 585–595 (1993)CrossRefGoogle Scholar
  38. 38.
    Tristan, J.-B., Govereau, P., Morrisett, G.: Evaluating value-graph translation validation for LLVM. ACM SIGPLAN Not. 46(6), 295–305 (2011)CrossRefGoogle Scholar
  39. 39.
    Wang, J.T.-L., Ma, Q., Shasha, D., Wu, C.H.: New techniques for extracting features from protein sequences. IBM Syst. J. 40(2), 426–441 (2001)CrossRefGoogle Scholar
  40. 40.
    Yujian, L., Bo, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1091–1095 (2007)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Saed Alrabaee
    • 1
  • Paria Shirani
    • 1
  • Lingyu Wang
    • 1
  • Mourad Debbabi
    • 1
  • Aiman Hanna
    • 1
  1. 1.Security Research CenterConcordia UniversityMontrealCanada

Personalised recommendations