Skip to main content

EnCoD: Distinguishing Compressed and Encrypted File Fragments

  • Conference paper
  • First Online:
Network and System Security (NSS 2020)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12570))

Included in the following conference series:

Abstract

Reliable identification of encrypted file fragments is a requirement for several security applications, including ransomware detection, digital forensics, and traffic analysis. A popular approach consists of estimating high entropy as a proxy for randomness. However, many modern content types (e.g. office documents, media files, etc.) are highly compressed for storage and transmission efficiency. Compression algorithms also output high-entropy data, thus reducing the accuracy of entropy-based encryption detectors.

Over the years, a variety of approaches have been proposed to distinguish encrypted file fragments from high-entropy compressed fragments. However, these approaches are typically only evaluated over a few, selected data types and fragment sizes, which makes a fair assessment of their practical applicability impossible. This paper aims to close this gap by comparing existing statistical tests on a large, standardized dataset. Our results show that current approaches cannot reliably tell apart encryption and compression, even for large fragment sizes. To address this issue, we design EnCoD, a learning-based classifier which can reliably distinguish compressed and encrypted data, starting with fragments as small as 512 bytes. We evaluate EnCoD against current approaches over a large dataset of different data types, showing that it outperforms current state-of-the-art for most considered fragment sizes and data types.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Pycriptodome library. https://pycryptodome.readthedocs.io/en/latest/src/introduction.html

  2. DOCX Transitional (Office Open XML), January 2017. https://www.loc.gov/preservation/digital/formats/fdd/fdd000397.shtml

  3. Atlanta spent \$2.6m to recover from a \$52,000 ransomware scare (2018). https://www.wired.com/story/atlanta-spent-26m-recover-from-ransomware-scare/

  4. Wannacry cyber attack cost the NHS £92m as 19,000 appointments cancelled (2018). https://www.telegraph.co.uk/technology/2018/10/11/wannacry-cyber-attack-cost-nhs-92m-19000-appointments-cancelled/

  5. Evolvingai: Deep neural networks are easily fooled: High confidence predictions for unrecognizable images, December 2019. http://www.evolvingai.org/fooling

  6. FMA: A dataset for music analysis, December 2019. https://github.com/mdeff/fma

  7. Open images dataset v5, December 2019. https://www.figure-eight.com/dataset/open-images-annotated-with-bounding-boxes/

  8. Wikipedia: database download, December 2019. https://dumps.wikimedia.org/enwiki/

  9. arXiv.org e-Print archive, February 2020. https://arxiv.org/

  10. Ransomware attacks grow, crippling cities and businesses (2020). https://www.nytimes.com/2020/02/09/technology/ransomware-attacks.html

  11. Ameeno, N., Sherry, K., Gagneja, K.: Using machine learning to detect the file compression or encryption. Amity J. Comput. Sci. 3(1), 6 (2019)

    Google Scholar 

  12. Casino, F., Choo, K.K.R., Patsakis, C.: HEDGE: efficient traffic classification of encrypted and compressed packets. IEEE Trans. Inf. Forensics Secur. 14(11), 2916–2926 (2019)

    Article  Google Scholar 

  13. Chollet, F., et al.: Keras (2015). https://keras.io

  14. Choudhury, P., Kumar, K.R.P., Nandi, S., Athithan, G.: An empirical approach towards characterization of encrypted and unencrypted VoIP traffic. Multimedia Tools Appl. 79(1–2), 603–631 (2020)

    Article  Google Scholar 

  15. Computer Security Division, I.T.L.: NIST SP 800-22: Documentation and Software, May 2016. https://csrc.nist.gov/projects/random-bit-generation/documentation-and-software

  16. Conti, G., et al.: Automated mapping of large binary objects using primitive fragment type classification. Digital Invest. 7, S3–S12 (2010)

    Article  Google Scholar 

  17. Continella, A., et al.: Shieldfs: a self-healing, ransomware-aware filesystem. In: ACSAC (2016)

    Google Scholar 

  18. De Carli, L., Torres, R., Modelo-Howard, G., Tongaonkar, A., Jha, S.: Botnet protocol inference in the presence of encrypted traffic. In: INFOCOM (2017)

    Google Scholar 

  19. De Gaspari, F., Hitaj, D., Pagnotta, G., De Carli, L., Mancini, L.V.: The naked sun: malicious cooperation between benign-looking processes. In: 18th International Conference on Applied Cryptography and Network Security. ACNS (2020)

    Google Scholar 

  20. Dorfinger, P., Panholzer, G., John, W.: Entropy estimation for real-time encrypted traffic identification. In: Traffic Monitoring and Analysis (2011)

    Google Scholar 

  21. Fielding, R., et al.: RFC 2616, hypertext transfer protocol - HTTP/1.1 (1999). http://www.rfc.net/rfc2616.html

  22. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS (2010)

    Google Scholar 

  23. Hahn, D., Apthorpe, N., Feamster, N.: Detecting compressed cleartext traffic from consumer internet of things devices (2018)

    Google Scholar 

  24. Hahn, D., Apthorpe, N., Feamster, N.: Detecting Compressed Cleartext Traffic from Consumer Internet of Things Devices. arXiv:1805.02722 [cs], May 2018. http://arxiv.org/abs/1805.02722

  25. Kharraz, A., Kirda, E.: Redemption: real-time protection against ransomware at end-hosts. In: RAID (2017)

    Google Scholar 

  26. Kirda, E.: Unveil: a large-scale, automated approach to detecting ransomware (keynote). In: SANER (2017)

    Google Scholar 

  27. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks. CoRR abs/1706.02515 (2017). http://arxiv.org/abs/1706.02515

  28. LeCun, Y., Bottou, L., Orr, G.B., Müller, K.R.: Efficient backprop. In: Neural Networks: Tricks of the Trade (1998)

    Google Scholar 

  29. Lee, H., Ge, R., Ma, T., Risteski, A., Arora, S.: On the ability of neural nets to express distributions. In: Kale, S., Shamir, O. (eds.) Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7–10 July 2017. Proceedings of Machine Learning Research, vol. 65, pp. 1271–1296. PMLR (2017). http://proceedings.mlr.press/v65/lee17a.html

  30. Malhotra, P.: Detection of encrypted streams for egress monitoring. Master of Science, Iowa State University, Ames (2007). https://lib.dr.iastate.edu/rtd/14632/

  31. Mamun, M.S.I., Ghorbani, A.A., Stakhanova, N.: An entropy based encrypted traffic classifier. In: Qing, S., Okamoto, E., Kim, K., Liu, D. (eds.) ICICS 2015. LNCS, vol. 9543, pp. 282–294. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-29814-6_23

    Chapter  Google Scholar 

  32. Mbol, F., Robert, J.-M., Sadighian, A.: An efficient approach to detect TorrentLocker ransomware in computer systems. In: Foresti, S., Persiano, G. (eds.) CANS 2016. LNCS, vol. 10052, pp. 532–541. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48965-0_32

    Chapter  Google Scholar 

  33. Mehnaz, S., Mudgerikar, A., Bertino, E.: Rwguard: a real-time detection system against cryptographic ransomware. In: Research in Attacks, Intrusions, and Defenses. RAID 2018 (2018)

    Google Scholar 

  34. Palisse, A., Durand, A., Le Bouder, H., Le Guernic, C., Lanet, J.-L.: Data aware defense (DaD): towards a generic and practical ransomware countermeasure. In: Lipmaa, H., Mitrokotsa, A., Matulevičius, R. (eds.) NordSec 2017. LNCS, vol. 10674, pp. 192–208. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70290-2_12

    Chapter  Google Scholar 

  35. Park, B., Savoldi, A., Gubian, P., Park, J., Lee, S.H., Lee, S.: Data extraction from damage compressed file for computer forensic purposes. Int. J. Hybrid Inf. Technol. 1(4), 14 (2008)

    Google Scholar 

  36. Rukhin, A., et al.: A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications. Special Publication 800-22r1a, NIST, April 2010

    Google Scholar 

  37. Trottier, L., Giguere, P., Chaib-draa, B.: Parametric exponential linear unit for deep convolutional neural networks. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA) (2017)

    Google Scholar 

  38. Wallace, G.K.: The jpeg still picture compression standard. IEEE Trans. Consum. Electron. 38(1), xviii–xxxiv (1992)

    Google Scholar 

  39. Walls, R.J., Learned-Miller, E., Levine, B.N.: Forensic triage for mobile phones with DEC0DE. In: USENIX Security Symposium (2011)

    Google Scholar 

  40. Wang, R., Shoshitaishvili, Y., Kruegel, C., Vigna, G.: Steal this movie - automatically bypassing DRM protection in streaming media services. In: USENIX (2013)

    Google Scholar 

  41. Wang, Y., Zhang, Z., Guo, L., Li, S.: Using entropy to classify traffic more deeply. In: 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage, pp. 45–52, July 2011

    Google Scholar 

  42. Zhang, H., Papadopoulos, C., Massey, D.: Detecting encrypted botnet traffic. In: 2013 Proceedings IEEE INFOCOM, pp. 3453–1358, April 2013

    Google Scholar 

Download references

Acknowledgments

We would like to thank Daniele Venturi and Guinevere Gilman for their useful insights and comments. This work was supported by Gen4olive, a project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 101000427, and in part by the Italian MIUR through the Dipartimento di Informatica, Sapienza University of Rome, under Grant Dipartimenti di eccellenza 2018–2022.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dorjan Hitaj .

Editor information

Editors and Affiliations

Appendices

Appendix

A Entropy Analysis Results

Full results for the entropy analysis discussed in Sect. 2.4:

Chunk size: 512B

Format

Min

Q1

Median

Q3

Max

enc

7.427

7.569

7.591

7.613

7.709

zip

7.163

7.560

7.584

7.607

7.695

gzip

7.154

7.560

7.585

7.607

7.703

rar

7.381

7.563

7.587

7.610

7.692

jpeg

3.820

7.512

7.548

7.576

7.676

mp3

0.000

7.451

7.527

7.565

7.680

png

0.000

1.070

2.605

4.549

7.572

pdf

0.000

7.453

7.534

7.574

7.676

Chunk size: 2048B

Format

Min

Q1

Median

Q3

Max

enc

7.873

7.903

7.908

7.914

7.938

zip

7.816

7.898

7.904

7.910

7.935

gzip

7.847

7.898

7.904

7.910

7.933

rar

7.795

7.900

7.905

7.911

7.933

jpeg

5.123

7.856

7.873

7.884

7.917

mp3

0.379

7.703

7.838

7.871

7.916

png

0.000

1.312

2.815

4.752

7.808

pdf

0.000

7.820

7.875

7.893

7.930

Chunk size: 8192B

Format

Min

Q1

Median

Q3

Max

enc

7.969

7.976

7.978

7.979

7.984

zip

7.955

7.973

7.975

7.976

7.983

gzip

7.955

7.973

7.975

7.976

7.983

rar

7.960

7.974

7.976

7.977

7.983

jpeg

5.646

7.930

7.945

7.952

7.967

mp3

0.497

7.789

7.918

7.942

7.971

png

0.014

1.451

2.963

4.852

7.914

pdf

0.010

7.903

7.953

7.968

7.981

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

De Gaspari, F., Hitaj, D., Pagnotta, G., De Carli, L., Mancini, L.V. (2020). EnCoD: Distinguishing Compressed and Encrypted File Fragments. In: Kutyłowski, M., Zhang, J., Chen, C. (eds) Network and System Security. NSS 2020. Lecture Notes in Computer Science(), vol 12570. Springer, Cham. https://doi.org/10.1007/978-3-030-65745-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-65745-1_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-65744-4

  • Online ISBN: 978-3-030-65745-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics