Code Type Revealing Using Experiments Framework

  • Rami Sharon
  • Ehud Gudes
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7371)


Identifying the type of a code, whether in a file or byte stream, is a challenge that many software companies are facing. Many applications, security and others, base their behavior on the type of code they receive as an input.

Today’s traditional identification methods rely on file extensions, magic numbers, propriety headers and trailers or specific type identifying rules. All these are vulnerable to content tampering and discovering it requires investing long and tedious working hours of professionals. This study is aimed to find a method of identifying the best settings to automatically create type signatures that will effectively overcome the content manipulation problem.

In this paper we lay out a framework for creating type signatures based on byte N-Grams. The framework allows setting various parameters such as NGram sizes and windows, selecting statistical tests and defining rules for score calculations. The framework serves as a test lab that allows finding the right parameters to satisfy a predefined threshold of type identification accuracy. We demonstrate the framework using basic settings that achieved an F-Measure success rate of 0.996 on 1400 test files.


File Type Content type revealing framework Code type Byte N-Gram statistical analysis 


  1. 1.
    McDaniel, M., Heydari, M.H.: Content Based File Type Detection Algorithms. In: Proceedings for the 36th Hawaii International Conference on System Sciences (2002)Google Scholar
  2. 2.
    Li, W.-J., Stolfo, S.J., Herzog, B.: Fileprints: Identifying File Types by n-gram Analysis. In: 2005 IEEE Workshop on Information Assurance, West Point, NY (2005)Google Scholar
  3. 3.
    Karresand, M., Shahmehri, N.: Oscar – File Type Identification of Binary Data in Disk Clusters and RAM Pages. In: Fischer-Hübner, S., Rannenberg, K., Yngström, L., Lindskog, S. (eds.) Security and Privacy in Dynamic Environment. IFIP, vol. 206, pp. 413–424. Springer, Boston (2006)CrossRefGoogle Scholar
  4. 4.
    Karresand, M., Shahmehri, N.: File Type Identification of Data Fragments by Their Binary Structure. In: Proceedings of the 2006 IEEE Workshop on Information Assurance United States Military Academy, West Point, NY (2006)Google Scholar
  5. 5.
    Kolter, J.Z., Maloof, M.A.: Learning to Detect Malicious Executables in the Wild. In: Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2004)Google Scholar
  6. 6.
    Dash, K.S., Dubba, S.R.K., Pujari, K.A.: New Malicious Code Detection Using Variable Length n-grams. In: Algorithms, Architectures and Information Systems Security, ch. 14, pp. 307–323. World Scientific (2008)Google Scholar
  7. 7.
    Irfan, A., Kyung, L., Hyunjung, S., ManPyo, H.: Content-Based File-type Identification Using Cosine Similarity and a Divide-and-Conquer Approach. IETE Technical Review 27(4) (July 2010)Google Scholar
  8. 8.
    Moskovitch, R., et al.: Unknown malcode detection and the imbalance problem. Journal in Computer Virology 5(4), 295–308 (2009)CrossRefGoogle Scholar
  9. 9.
    Pedersen, T., Banerjee, S., Purandare, A., McInnes, B.T., Liu, Y.: NSP - Ngram Statistics Package (2009)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2012

Authors and Affiliations

  • Rami Sharon
    • 1
  • Ehud Gudes
    • 2
  1. 1.The Open UniversityRa’ananaIsrael
  2. 2.Ben-Gurion UniversityBeer-ShevaIsrael

Personalised recommendations