Full Content Search in Malware Collections

  • Andrei MihalcaEmail author
  • Ciprian Oprişa
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11398)


This paper aims to provide the techniques for performing fast searches by content in large malware collections. The ability to retrieve malware samples sharing a given content is important for malware researchers that look for previous instances of a new sample or test new signatures. We propose a data structure that allows fast searches and can be continuously expanded with new samples. The performance and the scalability of our solution are proved through experiments on real-world malware.


Malware Big data Content search 



Research supported, in part, by EC H2020 SMESEC GA #740787 and EC H2020 CIPSEC GA #700378.


  1. 1.
  2. 2.
    Linux programmer’s manual (2018).
  3. 3.
    Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)MathSciNetCrossRefGoogle Scholar
  4. 4.
    AV-Test: Malware statistics (2017).
  5. 5.
    Chen, Z., Roussopoulos, M., Liang, Z., Zhang, Y., Chen, Z., Delis, A.: Malware characteristics and threats on the internet ecosystem. J. Syst. Softw. 85(7), 1650–1672 (2012)CrossRefGoogle Scholar
  6. 6.
    The PostgreSQL Global Development Group: PostgreSQL (2018).
  7. 7.
    Jin, W., Hines, C., Cohen, C., Narasimhan, P.: A scalable search index for binary files. In: Proceedings of the 2012 7th International Conference on Malicious and Unwanted Software (MALWARE), MALWARE 2012, pp. 94–103. IEEE Computer Society, Washington, DC, USA (2012).
  8. 8.
    Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 1st edn. Prentice Hall PTR, Upper Saddle River (2000)Google Scholar
  9. 9.
    Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)MathSciNetCrossRefGoogle Scholar
  10. 10.
    FAL Labs: Tokyocabinet (2018).
  11. 11.
    Redis Labs: Redis (2018).
  12. 12.
    MongoDB, Inc: MongoDB (2018).
  13. 13.
    Oprisa, C., Cabau, G., Colesa, A.: From plagiarism to malware detection. In: 2013 15th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp. 227–234, September 2013Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.BitdefenderCluj-NapocaRomania
  2. 2.Technical University of Cluj-NapocaCluj-NapocaRomania

Personalised recommendations