Abstract
Reliable identification of encrypted file fragments is a requirement for several security applications, including ransomware detection, digital forensics, and traffic analysis. A popular approach consists of estimating high entropy as a proxy for randomness. However, many modern content types (e.g. office documents, media files, etc.) are highly compressed for storage and transmission efficiency. Compression algorithms also output high-entropy data, thus reducing the accuracy of entropy-based encryption detectors.
Over the years, a variety of approaches have been proposed to distinguish encrypted file fragments from high-entropy compressed fragments. However, these approaches are typically only evaluated over a few, selected data types and fragment sizes, which makes a fair assessment of their practical applicability impossible. This paper aims to close this gap by comparing existing statistical tests on a large, standardized dataset. Our results show that current approaches cannot reliably tell apart encryption and compression, even for large fragment sizes. To address this issue, we design EnCoD, a learning-based classifier which can reliably distinguish compressed and encrypted data, starting with fragments as small as 512 bytes. We evaluate EnCoD against current approaches over a large dataset of different data types, showing that it outperforms current state-of-the-art for most considered fragment sizes and data types.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Pycriptodome library. https://pycryptodome.readthedocs.io/en/latest/src/introduction.html
DOCX Transitional (Office Open XML), January 2017. https://www.loc.gov/preservation/digital/formats/fdd/fdd000397.shtml
Atlanta spent \$2.6m to recover from a \$52,000 ransomware scare (2018). https://www.wired.com/story/atlanta-spent-26m-recover-from-ransomware-scare/
Wannacry cyber attack cost the NHS £92m as 19,000 appointments cancelled (2018). https://www.telegraph.co.uk/technology/2018/10/11/wannacry-cyber-attack-cost-nhs-92m-19000-appointments-cancelled/
Evolvingai: Deep neural networks are easily fooled: High confidence predictions for unrecognizable images, December 2019. http://www.evolvingai.org/fooling
FMA: A dataset for music analysis, December 2019. https://github.com/mdeff/fma
Open images dataset v5, December 2019. https://www.figure-eight.com/dataset/open-images-annotated-with-bounding-boxes/
Wikipedia: database download, December 2019. https://dumps.wikimedia.org/enwiki/
arXiv.org e-Print archive, February 2020. https://arxiv.org/
Ransomware attacks grow, crippling cities and businesses (2020). https://www.nytimes.com/2020/02/09/technology/ransomware-attacks.html
Ameeno, N., Sherry, K., Gagneja, K.: Using machine learning to detect the file compression or encryption. Amity J. Comput. Sci. 3(1), 6 (2019)
Casino, F., Choo, K.K.R., Patsakis, C.: HEDGE: efficient traffic classification of encrypted and compressed packets. IEEE Trans. Inf. Forensics Secur. 14(11), 2916–2926 (2019)
Chollet, F., et al.: Keras (2015). https://keras.io
Choudhury, P., Kumar, K.R.P., Nandi, S., Athithan, G.: An empirical approach towards characterization of encrypted and unencrypted VoIP traffic. Multimedia Tools Appl. 79(1–2), 603–631 (2020)
Computer Security Division, I.T.L.: NIST SP 800-22: Documentation and Software, May 2016. https://csrc.nist.gov/projects/random-bit-generation/documentation-and-software
Conti, G., et al.: Automated mapping of large binary objects using primitive fragment type classification. Digital Invest. 7, S3–S12 (2010)
Continella, A., et al.: Shieldfs: a self-healing, ransomware-aware filesystem. In: ACSAC (2016)
De Carli, L., Torres, R., Modelo-Howard, G., Tongaonkar, A., Jha, S.: Botnet protocol inference in the presence of encrypted traffic. In: INFOCOM (2017)
De Gaspari, F., Hitaj, D., Pagnotta, G., De Carli, L., Mancini, L.V.: The naked sun: malicious cooperation between benign-looking processes. In: 18th International Conference on Applied Cryptography and Network Security. ACNS (2020)
Dorfinger, P., Panholzer, G., John, W.: Entropy estimation for real-time encrypted traffic identification. In: Traffic Monitoring and Analysis (2011)
Fielding, R., et al.: RFC 2616, hypertext transfer protocol - HTTP/1.1 (1999). http://www.rfc.net/rfc2616.html
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS (2010)
Hahn, D., Apthorpe, N., Feamster, N.: Detecting compressed cleartext traffic from consumer internet of things devices (2018)
Hahn, D., Apthorpe, N., Feamster, N.: Detecting Compressed Cleartext Traffic from Consumer Internet of Things Devices. arXiv:1805.02722 [cs], May 2018. http://arxiv.org/abs/1805.02722
Kharraz, A., Kirda, E.: Redemption: real-time protection against ransomware at end-hosts. In: RAID (2017)
Kirda, E.: Unveil: a large-scale, automated approach to detecting ransomware (keynote). In: SANER (2017)
Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks. CoRR abs/1706.02515 (2017). http://arxiv.org/abs/1706.02515
LeCun, Y., Bottou, L., Orr, G.B., Müller, K.R.: Efficient backprop. In: Neural Networks: Tricks of the Trade (1998)
Lee, H., Ge, R., Ma, T., Risteski, A., Arora, S.: On the ability of neural nets to express distributions. In: Kale, S., Shamir, O. (eds.) Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7–10 July 2017. Proceedings of Machine Learning Research, vol. 65, pp. 1271–1296. PMLR (2017). http://proceedings.mlr.press/v65/lee17a.html
Malhotra, P.: Detection of encrypted streams for egress monitoring. Master of Science, Iowa State University, Ames (2007). https://lib.dr.iastate.edu/rtd/14632/
Mamun, M.S.I., Ghorbani, A.A., Stakhanova, N.: An entropy based encrypted traffic classifier. In: Qing, S., Okamoto, E., Kim, K., Liu, D. (eds.) ICICS 2015. LNCS, vol. 9543, pp. 282–294. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-29814-6_23
Mbol, F., Robert, J.-M., Sadighian, A.: An efficient approach to detect TorrentLocker ransomware in computer systems. In: Foresti, S., Persiano, G. (eds.) CANS 2016. LNCS, vol. 10052, pp. 532–541. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48965-0_32
Mehnaz, S., Mudgerikar, A., Bertino, E.: Rwguard: a real-time detection system against cryptographic ransomware. In: Research in Attacks, Intrusions, and Defenses. RAID 2018 (2018)
Palisse, A., Durand, A., Le Bouder, H., Le Guernic, C., Lanet, J.-L.: Data aware defense (DaD): towards a generic and practical ransomware countermeasure. In: Lipmaa, H., Mitrokotsa, A., Matulevičius, R. (eds.) NordSec 2017. LNCS, vol. 10674, pp. 192–208. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70290-2_12
Park, B., Savoldi, A., Gubian, P., Park, J., Lee, S.H., Lee, S.: Data extraction from damage compressed file for computer forensic purposes. Int. J. Hybrid Inf. Technol. 1(4), 14 (2008)
Rukhin, A., et al.: A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications. Special Publication 800-22r1a, NIST, April 2010
Trottier, L., Giguere, P., Chaib-draa, B.: Parametric exponential linear unit for deep convolutional neural networks. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA) (2017)
Wallace, G.K.: The jpeg still picture compression standard. IEEE Trans. Consum. Electron. 38(1), xviii–xxxiv (1992)
Walls, R.J., Learned-Miller, E., Levine, B.N.: Forensic triage for mobile phones with DEC0DE. In: USENIX Security Symposium (2011)
Wang, R., Shoshitaishvili, Y., Kruegel, C., Vigna, G.: Steal this movie - automatically bypassing DRM protection in streaming media services. In: USENIX (2013)
Wang, Y., Zhang, Z., Guo, L., Li, S.: Using entropy to classify traffic more deeply. In: 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage, pp. 45–52, July 2011
Zhang, H., Papadopoulos, C., Massey, D.: Detecting encrypted botnet traffic. In: 2013 Proceedings IEEE INFOCOM, pp. 3453–1358, April 2013
Acknowledgments
We would like to thank Daniele Venturi and Guinevere Gilman for their useful insights and comments. This work was supported by Gen4olive, a project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 101000427, and in part by the Italian MIUR through the Dipartimento di Informatica, Sapienza University of Rome, under Grant Dipartimenti di eccellenza 2018–2022.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix
A Entropy Analysis Results
Full results for the entropy analysis discussed in Sect. 2.4:
Chunk size: 512B | |||||
---|---|---|---|---|---|
Format | Min | Q1 | Median | Q3 | Max |
enc | 7.427 | 7.569 | 7.591 | 7.613 | 7.709 |
zip | 7.163 | 7.560 | 7.584 | 7.607 | 7.695 |
gzip | 7.154 | 7.560 | 7.585 | 7.607 | 7.703 |
rar | 7.381 | 7.563 | 7.587 | 7.610 | 7.692 |
jpeg | 3.820 | 7.512 | 7.548 | 7.576 | 7.676 |
mp3 | 0.000 | 7.451 | 7.527 | 7.565 | 7.680 |
png | 0.000 | 1.070 | 2.605 | 4.549 | 7.572 |
0.000 | 7.453 | 7.534 | 7.574 | 7.676 | |
Chunk size: 2048B | |||||
Format | Min | Q1 | Median | Q3 | Max |
enc | 7.873 | 7.903 | 7.908 | 7.914 | 7.938 |
zip | 7.816 | 7.898 | 7.904 | 7.910 | 7.935 |
gzip | 7.847 | 7.898 | 7.904 | 7.910 | 7.933 |
rar | 7.795 | 7.900 | 7.905 | 7.911 | 7.933 |
jpeg | 5.123 | 7.856 | 7.873 | 7.884 | 7.917 |
mp3 | 0.379 | 7.703 | 7.838 | 7.871 | 7.916 |
png | 0.000 | 1.312 | 2.815 | 4.752 | 7.808 |
0.000 | 7.820 | 7.875 | 7.893 | 7.930 | |
Chunk size: 8192B | |||||
Format | Min | Q1 | Median | Q3 | Max |
enc | 7.969 | 7.976 | 7.978 | 7.979 | 7.984 |
zip | 7.955 | 7.973 | 7.975 | 7.976 | 7.983 |
gzip | 7.955 | 7.973 | 7.975 | 7.976 | 7.983 |
rar | 7.960 | 7.974 | 7.976 | 7.977 | 7.983 |
jpeg | 5.646 | 7.930 | 7.945 | 7.952 | 7.967 |
mp3 | 0.497 | 7.789 | 7.918 | 7.942 | 7.971 |
png | 0.014 | 1.451 | 2.963 | 4.852 | 7.914 |
0.010 | 7.903 | 7.953 | 7.968 | 7.981 |
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
De Gaspari, F., Hitaj, D., Pagnotta, G., De Carli, L., Mancini, L.V. (2020). EnCoD: Distinguishing Compressed and Encrypted File Fragments. In: Kutyłowski, M., Zhang, J., Chen, C. (eds) Network and System Security. NSS 2020. Lecture Notes in Computer Science(), vol 12570. Springer, Cham. https://doi.org/10.1007/978-3-030-65745-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-65745-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65744-4
Online ISBN: 978-3-030-65745-1
eBook Packages: Computer ScienceComputer Science (R0)