Skip to main content

Advertisement

Log in

Binary executable file similarity calculation using function matching

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Nowadays, computer software is an essential part in our lives and is used in various fields. While software gives us convenience, it also causes many problems. Various research efforts are needed to defend against software plagiarism, attacks using malware/software, and so on. Analysis techniques of binary executable files can be applied to investigate and defend these problems. However, it is relatively hard to analyze binary executable files without source code information, because executable files only have the information for execution and discard semantic information during the compiling process. In this paper, we proposed a similarity calculation method for binary executable files, based on function matching techniques. Attributes of a function are extracted and these attributes are used to match functions of two binary files. Our function matching process is composed of three steps: the function name matching step, the N-tuple matching step, and the final n-gram-based matching step. After the function matching process is performed, the overall similarity is calculated based on similarities of matched functions. Experimental results show that similarity accuracy of our binary-based similarity calculation method is similar to those of a well-known source-code-based method, call MOSS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Internet security threat report. https://resource.elq.symantec.com/LP=2899. Accessed 31 Aug 2016

  2. Statista report. http://www.statista.com/statistics/203428/totalenterprise-software-revenue-forecast/. Accessed 31 Aug 2016

  3. Slagter K, Hsu CH, Chung YC, Zhang D (2016) An improved partitioning mechanism for optimizing massive data analysis using mapreduce. J Supercomput 66(1):539–555

    Article  Google Scholar 

  4. Viswanathan V (2016) Discovery of semantic associations in an rdf graph using bi-directional bfs on massively parallel hardware. Int J Big Data Intell 3(3):176–181

    Article  Google Scholar 

  5. Kunfang S, Lu H (2016) Efficient querying dristributed big-xml data using mapreduce. Int J Grid High Perform Comput 8(3):72–82

    Article  Google Scholar 

  6. Ahmadi M, Ulyanov D, Semenov S, Tromov M, Giacinto G (2016) Novel feature extraction, selection and fusion for effective malware family classication. In: Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, ACM, pp 183–194

  7. Abawajy J, Chowdhury M, Kelarev A (2015) Hybrid consensus pruning of ensemble classfiers for big data malware detection. IEEE Trans Cloud Comput 3(2):1–11

    Article  Google Scholar 

  8. Ida pro disassembler. http://www.datarescue.com/idabase. Accessed 31 Aug 2016

  9. Pin user manual. http://rogue.colorado.edu/. Accessed 31 Aug 2016

  10. Allen FE (1970) Control flow analysis. Proc Symp Compil Optim 5(7):1–19

    Google Scholar 

  11. Measure of software similarity. http://theory.stanford.edu/aiken/moss/. Accessed 31 Aug 2016

  12. Cavnar WB, Trenkle JM et al (1994) N-gram based text categorization. Ann Arbor MI 48113(2):161–175

    Google Scholar 

  13. Cesare S, Xiang Y, Zhou W (2014) Control flow-based malware variant detection. IEEE Trans Depend Secure Comput 11(4):307–317

    Article  Google Scholar 

  14. Chilowicz M, Duris E, Roussel G (2009) Finding similarities in source code through factorization. Electron Notes Theor Comput Sci 238(5):47–62

    Article  Google Scholar 

  15. Flake H (2004) Structural comparison of executable objects. In: Proceedings of the DIMVA, pp 161–173

  16. Jaccard P (1901) Etude comparative de la distribution orale dans une portion des Alpes et du Jura. Impr, Corbaz

  17. Jang J, Brumley D, Venkataraman S (2011) Bitshred: feature hashing malware for scalable triage and semantic analysis. In: Proceedings of the 18th ACM Conference on Computer and Communications Security, ACM, pp 309–320

  18. Kang B, Kim T, Kwon H, Choi Y, Im EG (2012) Malware classication method via binary content comparison. In: Proceedings of the 2012 ACM Research in Applied Computation Symposium, ACM, pp 316–321

  19. Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. North-Holland Publishing Co, North-Holland

  20. Kinable J, Kostakis O (2011) Malware classifcation based on call graph clustering. J Comput Virol 7(4):233–245

    Article  Google Scholar 

  21. Lee YR, Kang B, Im EG (2013) Function matching based binary-level software similarity calculation. In: Proceedings of the 2013 Research in Adaptive and Convergent Systems, ACM, pp 322–327

  22. OKane, Sezer, McLaughlin, and Im] OKane P, Sezer S, McLaughlin K, Im EG, (2013) SVM training phase reduction using dataset feature filtering for malware detection. IEEE Trans Inf Forensics Secur 8(3):500–509

  23. Rad BB, Masrom M (2011) Metamorphic virus variants classifcation using opcode frequency histogram. arXiv preprint: arXiv:1104.3228

  24. Ryder BG (1979) Constructing the call graph of a program. IEEE Trans Softw Eng 3:216–226

    Article  MathSciNet  MATH  Google Scholar 

  25. Santos I, Penya YK, Devesa J, Bringas PG (2009) N-grams-based file signatures for malware detection. Proc Int Conf Enterp Inf Syst (ICEIS) (2)9:317–320

  26. Shang S, Zheng N, Xu J, Xu M, Zhang H (2010) Detecting malware variants via function-call graph similarity. In: Proceedings of 5th International Conference on the Malicious and Unwanted Software (MALWARE), IEEE, pp 113–120

  27. Walenstein, Venable, Hayes, Thompson, and Lakhotia] Walenstein A, Venable M, Hayes M, Thompson C, Lakhotia A (2007) Exploiting similarity between variants to defeat malware. In: Proceedings of the BlackHat DC Conf

  28. Walters B (1999) Vmware virtual platform. Linux J 1999(63es):6

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (MSIP) (No. NRF-2016R1A2B4015254).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eul Gyu Im.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, T., Lee, Y.R., Kang, B. et al. Binary executable file similarity calculation using function matching. J Supercomput 75, 607–622 (2019). https://doi.org/10.1007/s11227-016-1941-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1941-2

Keywords

Navigation