EnCoD: Distinguishing Compressed and Encrypted File Fragments

De Gaspari, Fabio; Hitaj, Dorjan; Pagnotta, Giulio; De Carli, Lorenzo; Mancini, Luigi V.

doi:10.1007/978-3-030-65745-1_3

Fabio De Gaspari¹¹,
Dorjan Hitaj¹¹,
Giulio Pagnotta¹¹,
Lorenzo De Carli¹² &
…
Luigi V. Mancini¹¹

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12570))

Included in the following conference series:

International Conference on Network and System Security

1407 Accesses
11 Citations

Abstract

Reliable identification of encrypted file fragments is a requirement for several security applications, including ransomware detection, digital forensics, and traffic analysis. A popular approach consists of estimating high entropy as a proxy for randomness. However, many modern content types (e.g. office documents, media files, etc.) are highly compressed for storage and transmission efficiency. Compression algorithms also output high-entropy data, thus reducing the accuracy of entropy-based encryption detectors.

Over the years, a variety of approaches have been proposed to distinguish encrypted file fragments from high-entropy compressed fragments. However, these approaches are typically only evaluated over a few, selected data types and fragment sizes, which makes a fair assessment of their practical applicability impossible. This paper aims to close this gap by comparing existing statistical tests on a large, standardized dataset. Our results show that current approaches cannot reliably tell apart encryption and compression, even for large fragment sizes. To address this issue, we design EnCoD, a learning-based classifier which can reliably distinguish compressed and encrypted data, starting with fragments as small as 512 bytes. We evaluate EnCoD against current approaches over a large dataset of different data types, showing that it outperforms current state-of-the-art for most considered fragment sizes and data types.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Pycriptodome library. https://pycryptodome.readthedocs.io/en/latest/src/introduction.html
DOCX Transitional (Office Open XML), January 2017. https://www.loc.gov/preservation/digital/formats/fdd/fdd000397.shtml
Atlanta spent \$2.6m to recover from a \$52,000 ransomware scare (2018). https://www.wired.com/story/atlanta-spent-26m-recover-from-ransomware-scare/
Wannacry cyber attack cost the NHS £92m as 19,000 appointments cancelled (2018). https://www.telegraph.co.uk/technology/2018/10/11/wannacry-cyber-attack-cost-nhs-92m-19000-appointments-cancelled/
Evolvingai: Deep neural networks are easily fooled: High confidence predictions for unrecognizable images, December 2019. http://www.evolvingai.org/fooling
FMA: A dataset for music analysis, December 2019. https://github.com/mdeff/fma
Open images dataset v5, December 2019. https://www.figure-eight.com/dataset/open-images-annotated-with-bounding-boxes/
Wikipedia: database download, December 2019. https://dumps.wikimedia.org/enwiki/
arXiv.org e-Print archive, February 2020. https://arxiv.org/
Ransomware attacks grow, crippling cities and businesses (2020). https://www.nytimes.com/2020/02/09/technology/ransomware-attacks.html
Ameeno, N., Sherry, K., Gagneja, K.: Using machine learning to detect the file compression or encryption. Amity J. Comput. Sci. 3(1), 6 (2019)
Google Scholar
Casino, F., Choo, K.K.R., Patsakis, C.: HEDGE: efficient traffic classification of encrypted and compressed packets. IEEE Trans. Inf. Forensics Secur. 14(11), 2916–2926 (2019)
Article Google Scholar
Chollet, F., et al.: Keras (2015). https://keras.io
Choudhury, P., Kumar, K.R.P., Nandi, S., Athithan, G.: An empirical approach towards characterization of encrypted and unencrypted VoIP traffic. Multimedia Tools Appl. 79(1–2), 603–631 (2020)
Article Google Scholar
Computer Security Division, I.T.L.: NIST SP 800-22: Documentation and Software, May 2016. https://csrc.nist.gov/projects/random-bit-generation/documentation-and-software
Conti, G., et al.: Automated mapping of large binary objects using primitive fragment type classification. Digital Invest. 7, S3–S12 (2010)
Article Google Scholar
Continella, A., et al.: Shieldfs: a self-healing, ransomware-aware filesystem. In: ACSAC (2016)
Google Scholar
De Carli, L., Torres, R., Modelo-Howard, G., Tongaonkar, A., Jha, S.: Botnet protocol inference in the presence of encrypted traffic. In: INFOCOM (2017)
Google Scholar
De Gaspari, F., Hitaj, D., Pagnotta, G., De Carli, L., Mancini, L.V.: The naked sun: malicious cooperation between benign-looking processes. In: 18th International Conference on Applied Cryptography and Network Security. ACNS (2020)
Google Scholar
Dorfinger, P., Panholzer, G., John, W.: Entropy estimation for real-time encrypted traffic identification. In: Traffic Monitoring and Analysis (2011)
Google Scholar
Fielding, R., et al.: RFC 2616, hypertext transfer protocol - HTTP/1.1 (1999). http://www.rfc.net/rfc2616.html
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS (2010)
Google Scholar
Hahn, D., Apthorpe, N., Feamster, N.: Detecting compressed cleartext traffic from consumer internet of things devices (2018)
Google Scholar
Hahn, D., Apthorpe, N., Feamster, N.: Detecting Compressed Cleartext Traffic from Consumer Internet of Things Devices. arXiv:1805.02722 [cs], May 2018. http://arxiv.org/abs/1805.02722
Kharraz, A., Kirda, E.: Redemption: real-time protection against ransomware at end-hosts. In: RAID (2017)
Google Scholar
Kirda, E.: Unveil: a large-scale, automated approach to detecting ransomware (keynote). In: SANER (2017)
Google Scholar
Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks. CoRR abs/1706.02515 (2017). http://arxiv.org/abs/1706.02515
LeCun, Y., Bottou, L., Orr, G.B., Müller, K.R.: Efficient backprop. In: Neural Networks: Tricks of the Trade (1998)
Google Scholar
Lee, H., Ge, R., Ma, T., Risteski, A., Arora, S.: On the ability of neural nets to express distributions. In: Kale, S., Shamir, O. (eds.) Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7–10 July 2017. Proceedings of Machine Learning Research, vol. 65, pp. 1271–1296. PMLR (2017). http://proceedings.mlr.press/v65/lee17a.html
Malhotra, P.: Detection of encrypted streams for egress monitoring. Master of Science, Iowa State University, Ames (2007). https://lib.dr.iastate.edu/rtd/14632/
Mamun, M.S.I., Ghorbani, A.A., Stakhanova, N.: An entropy based encrypted traffic classifier. In: Qing, S., Okamoto, E., Kim, K., Liu, D. (eds.) ICICS 2015. LNCS, vol. 9543, pp. 282–294. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-29814-6_23
Chapter Google Scholar
Mbol, F., Robert, J.-M., Sadighian, A.: An efficient approach to detect TorrentLocker ransomware in computer systems. In: Foresti, S., Persiano, G. (eds.) CANS 2016. LNCS, vol. 10052, pp. 532–541. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48965-0_32
Chapter Google Scholar
Mehnaz, S., Mudgerikar, A., Bertino, E.: Rwguard: a real-time detection system against cryptographic ransomware. In: Research in Attacks, Intrusions, and Defenses. RAID 2018 (2018)
Google Scholar
Palisse, A., Durand, A., Le Bouder, H., Le Guernic, C., Lanet, J.-L.: Data aware defense (DaD): towards a generic and practical ransomware countermeasure. In: Lipmaa, H., Mitrokotsa, A., Matulevičius, R. (eds.) NordSec 2017. LNCS, vol. 10674, pp. 192–208. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70290-2_12
Chapter Google Scholar
Park, B., Savoldi, A., Gubian, P., Park, J., Lee, S.H., Lee, S.: Data extraction from damage compressed file for computer forensic purposes. Int. J. Hybrid Inf. Technol. 1(4), 14 (2008)
Google Scholar
Rukhin, A., et al.: A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications. Special Publication 800-22r1a, NIST, April 2010
Google Scholar
Trottier, L., Giguere, P., Chaib-draa, B.: Parametric exponential linear unit for deep convolutional neural networks. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA) (2017)
Google Scholar
Wallace, G.K.: The jpeg still picture compression standard. IEEE Trans. Consum. Electron. 38(1), xviii–xxxiv (1992)
Google Scholar
Walls, R.J., Learned-Miller, E., Levine, B.N.: Forensic triage for mobile phones with DEC0DE. In: USENIX Security Symposium (2011)
Google Scholar
Wang, R., Shoshitaishvili, Y., Kruegel, C., Vigna, G.: Steal this movie - automatically bypassing DRM protection in streaming media services. In: USENIX (2013)
Google Scholar
Wang, Y., Zhang, Z., Guo, L., Li, S.: Using entropy to classify traffic more deeply. In: 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage, pp. 45–52, July 2011
Google Scholar
Zhang, H., Papadopoulos, C., Massey, D.: Detecting encrypted botnet traffic. In: 2013 Proceedings IEEE INFOCOM, pp. 3453–1358, April 2013
Google Scholar

Download references

Acknowledgments

We would like to thank Daniele Venturi and Guinevere Gilman for their useful insights and comments. This work was supported by Gen4olive, a project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 101000427, and in part by the Italian MIUR through the Dipartimento di Informatica, Sapienza University of Rome, under Grant Dipartimenti di eccellenza 2018–2022.

Author information

Authors and Affiliations

Dipartimento di Informatica, Sapienza Università di Roma, Rome, Italy
Fabio De Gaspari, Dorjan Hitaj, Giulio Pagnotta & Luigi V. Mancini
Department of Computer Science, Worcester Polytechnic Institute, Worcester, MA, USA
Lorenzo De Carli

Authors

Fabio De Gaspari
View author publications
You can also search for this author in PubMed Google Scholar
Dorjan Hitaj
View author publications
You can also search for this author in PubMed Google Scholar
Giulio Pagnotta
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo De Carli
View author publications
You can also search for this author in PubMed Google Scholar
Luigi V. Mancini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dorjan Hitaj .

Editor information

Editors and Affiliations

Wrocław University of Technology, Wroclaw, Poland
Mirosław Kutyłowski
Swinburne University of Technology, Hawthorn, VIC, Australia
Jun Zhang
James Cook University, Douglas, QLD, Australia
Chao Chen

Appendices

Appendix

A Entropy Analysis Results

Full results for the entropy analysis discussed in Sect. 2.4:

Chunk size: 512B
Format	Min	Q1	Median	Q3	Max
enc	7.427	7.569	7.591	7.613	7.709
zip	7.163	7.560	7.584	7.607	7.695
gzip	7.154	7.560	7.585	7.607	7.703
rar	7.381	7.563	7.587	7.610	7.692
jpeg	3.820	7.512	7.548	7.576	7.676
mp3	0.000	7.451	7.527	7.565	7.680
png	0.000	1.070	2.605	4.549	7.572
pdf	0.000	7.453	7.534	7.574	7.676
Chunk size: 2048B
Format	Min	Q1	Median	Q3	Max
enc	7.873	7.903	7.908	7.914	7.938
zip	7.816	7.898	7.904	7.910	7.935
gzip	7.847	7.898	7.904	7.910	7.933
rar	7.795	7.900	7.905	7.911	7.933
jpeg	5.123	7.856	7.873	7.884	7.917
mp3	0.379	7.703	7.838	7.871	7.916
png	0.000	1.312	2.815	4.752	7.808
pdf	0.000	7.820	7.875	7.893	7.930
Chunk size: 8192B
Format	Min	Q1	Median	Q3	Max
enc	7.969	7.976	7.978	7.979	7.984
zip	7.955	7.973	7.975	7.976	7.983
gzip	7.955	7.973	7.975	7.976	7.983
rar	7.960	7.974	7.976	7.977	7.983
jpeg	5.646	7.930	7.945	7.952	7.967
mp3	0.497	7.789	7.918	7.942	7.971
png	0.014	1.451	2.963	4.852	7.914
pdf	0.010	7.903	7.953	7.968	7.981

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

De Gaspari, F., Hitaj, D., Pagnotta, G., De Carli, L., Mancini, L.V. (2020). EnCoD: Distinguishing Compressed and Encrypted File Fragments. In: Kutyłowski, M., Zhang, J., Chen, C. (eds) Network and System Security. NSS 2020. Lecture Notes in Computer Science(), vol 12570. Springer, Cham. https://doi.org/10.1007/978-3-030-65745-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-65745-1_3
Published: 19 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65744-4
Online ISBN: 978-3-030-65745-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics