Abstract
Robust, static disassembly is an important part of achieving high coverage for many binary code analyses, such as reverse engineering, malware analysis, reference monitor in-lining, and software fault isolation. However, one of the major difficulties current disassemblers face is differentiating code from data when they are interleaved. This paper presents a machine learning-based disassembly algorithm that segments an x86 binary into subsequences of bytes and then classifies each subsequence as code or data. The algorithm builds a language model from a set of pre-tagged binaries using a statistical data compression technique. It sequentially scans a new binary executable and sets a breaking point at each potential code-to-code and code-to-data/data-to-code transition. The classification of each segment as code or data is based on the minimum cross-entropy. Experimental results are presented to demonstrate the effectiveness of the algorithm.
This material is based upon work supported by the AFOSR under contracts FA9550-08-1-0044, FA9550-10-1-0088, and FA9550-08-1-0265; by the NIH through grant number 1R01LM009989; and by the NSF through grant numbers Career-0845803, CNS-0964350, and CNS-1016343.
Chapter PDF
Similar content being viewed by others
References
Balakrishnan, G., Gruian, R., Reps, T., Teitelbaum, T.: CodeSurfer/x86—a platform for analyzing x86 executables. In: Proceedings of the 14th International Conference on Compiler Construction (CC), pp. 250–254 (2005)
Bratko, A., Cormack, G.V., Filipič, B., Lynam, T.R., Zupan, B.: Spam filtering using statistical data compression models. Journal of Machine Learning Research 7, 2673–2698 (2006)
Cleary, J.G., Teahan, W.J.: Unbounded length contexts for PPM. The Computer Journal 40(2/3), 67–75 (1997)
Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications 32(4), 396–402 (1984)
Cormack, G.V., Horspool, R.N.: Data compression using dynamic Markov modeling. The Computer Journal 30(6), 541–550 (1987)
Eagle, C.: The IDA Pro Book: The Unofficial Guide to the World’s Most Popular Disassembler. No Starch Press, Inc., San Francisco (2008)
Erdélyi, G.: IDAPython: User scripting for a complex application. Bachelor’s thesis, EVTEK University of Applied Sciences (2008)
Hamlen, K.W., Mohan, V., Wartell, R.: Reining in Windows API abuses with in-lined reference monitors. Tech. Rep. UTDCS-18-10, The University of Texas at Dallas, Richardson, Texas (June 2010)
Hex-Rays: The IDA Pro disassembler and debugger (2011), www.hex-rays.com/idapro
Hunt, G., Brubacher, D.: Detours: Binary interception of Win32 functions. In: Proceedings of the 3rd USENIX Windows NT Symposium (WINSYM), pp. 14–21 (1999)
Intel: Intel® 64 and IA-32 Architectures Software Developer’s Manual, vol. 2A & 2B: Instruction Set Reference. Intel Corporation (2011)
Kinder, J., Veith, H.: Jakstab: A static analysis platform for binaries. In: Gupta, A., Malik, S. (eds.) CAV 2008. LNCS, vol. 5123, pp. 423–427. Springer, Heidelberg (2008)
Kinder, J., Zuleger, F., Veith, H.: An abstract interpretation-based framework for control flow reconstruction from binaries. In: Jones, N.D., Müller-Olm, M. (eds.) VMCAI 2009. LNCS, vol. 5403, pp. 214–228. Springer, Heidelberg (2009)
Moffat, A., Turpin, A.: Compression and Coding Algorithms. Kluwer Academic Publishers, Boston (2002)
Teahan, W.J., Wen, Y., McNab, R.J., Witten, I.H.: A compression-based algorithm for Chinese word segmentation. Computational Linguistics 26(3), 375–393 (2000)
Teahan, W.J.: Text classification and segmentation using minimum cross-entropy. In: Proceedings of the 6th International Conference on Computer-Assisted Information Retrieval (RIAO), pp. 943–961 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wartell, R., Zhou, Y., Hamlen, K.W., Kantarcioglu, M., Thuraisingham, B. (2011). Differentiating Code from Data in x86 Binaries. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science(), vol 6913. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23808-6_34
Download citation
DOI: https://doi.org/10.1007/978-3-642-23808-6_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23807-9
Online ISBN: 978-3-642-23808-6
eBook Packages: Computer ScienceComputer Science (R0)