Fast and Effective Clustering of Spam Emails Based on Structural Similarity

  • Mina Sheikhalishahi
  • Andrea SaracinoEmail author
  • Mohamed Mejri
  • Nadia Tawbi
  • Fabio Martinelli
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9482)


Spam emails yearly impose extremely heavy costs in terms of time, storage space and money to both private users and companies. Finding and persecuting spammers and eventual spam emails stakeholders should allow to directly tackle the root of the problem. To facilitate such a difficult analysis, which should be performed on large amounts of unclassified raw emails, in this paper we propose a framework to fast and effectively divide large amount of spam emails into homogeneous campaigns through structural similarity. The framework exploits a set of 21 features representative of the email structure and a novel categorical clustering algorithm named Categorical Clustering Tree (CCTree). The methodology is evaluated and validated through standard tests performed on three dataset accounting to more than 200k real recent spam emails.


Cluster Algorithm Shannon Entropy Internal Evaluation External Evaluation Stop Condition 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
  2. 2.
    Anderson, D., Fleizach, C., Savage, S., Voelker, G.: Spamscatter: Characterizing internet scam hosting infrastructure. In: Proceedings of 16th USENIX Security Symposium (2007)Google Scholar
  3. 3.
    Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: LIMBO: scalable clustering of categorical data. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 123–146. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  4. 4.
    Bezdek, J., Pal, N.: Cluster validation with generalized dunn’s indices. In: Proceedings of Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems, pp. 190–193 (1995)Google Scholar
  5. 5.
    Blanzieri, E., Bryl, A.: A survey of learning-based techniques of email spam filtering. Artif. Intell. Rev. 29(1), 63–92 (2008)CrossRefGoogle Scholar
  6. 6.
    Calais, P., Pires, D., Guedes, D., Meira, W., Hoepers, C., Steding-Jessen, K.: A campaign-based characterization of spamming strategies. In: CEAS (2008)Google Scholar
  7. 7.
    Dinh, S., Azeb, T., Fortin, F., Mouheb, D., Debbabi, M.: Spam campaign detection, analysis, and investigation. Digit. Invest. 12(1(0)), S12–S21 (2015). DFRWS 2015 Europe Proceedings of the Second Annual DFRWS EuropeCrossRefGoogle Scholar
  8. 8.
    Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. IEEE Trans. Neural Netw. 10(5), 1048–1054 (1999)CrossRefGoogle Scholar
  9. 9.
    Fisher, D.: Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2(2), 139–172 (1987)Google Scholar
  10. 10.
    Garcia, S., Luengo, J., Saez, J.A., Lopez, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)CrossRefGoogle Scholar
  11. 11.
    Halkidi, M., Vazirgiannis, M.: Clustering validity assessment: finding the optimal partitioning of a data set. In: Proceedings of IEEE International Conference on Data Mining, ICDM 2001, pp. 187–194 (2001)Google Scholar
  12. 12.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newslett. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  13. 13.
    Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2011)Google Scholar
  14. 14.
    Hedley, J.: Jsoup cookbook (2009).
  15. 15.
    Kerber, R.: Chimerge: discretization of numeric attributes. In: Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI 1992, pp. 123–128. AAAI Press (1992)Google Scholar
  16. 16.
    Li, F., Hsieh, M.: An empirical study of clustering behavior of spammers and groupbased anti-spam strategies. In: CEAS 2006 Third Conference on Email and AntiSpam, pp. 27–28 (2006)Google Scholar
  17. 17.
    Manning, C.D., Prabhakar, R., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefzbMATHGoogle Scholar
  18. 18.
    Martin, S., Nelson, B., Sewani, A., Chen, K., Joseph, A.D.: Analyzing behavioral features for email classification. In: CEAS (2005)Google Scholar
  19. 19.
    Pu, C., Webb, S.: Observed trends in spam construction techniques: a case study of spam evolution. In: CEAS, pp. 104–112 (2006)Google Scholar
  20. 20.
    Radicati, S.: Email statistics report 2013–2017 (2013).
  21. 21.
    Ramachandran, A., Feamster, N.: Understanding the network-level behavior of spammers. ACM SIGCOMM Comput. Commun. Rev. 36(4), 291–302 (2006)CrossRefGoogle Scholar
  22. 22.
    Rao, J., Reiley, D.: On the spam campaign trail, the economics of spam. J. Econ. Perspect. 26(3), 87–110 (2012)CrossRefGoogle Scholar
  23. 23.
    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRefzbMATHGoogle Scholar
  24. 24.
    Salvador, S., Chan, P.: Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004, pp. 576–584. IEEE Computer Society, Washington, DC (2004)Google Scholar
  25. 25.
    Seewald, A.: An evaluation of naive bayes variants in content-based learning for spam filtering. Intell. Data Anal. 11(5), 497–524 (2007)Google Scholar
  26. 26.
    Shannon, C.E.: A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)CrossRefGoogle Scholar
  27. 27.
    Sheikhalishahi, M., Mejri, M., Tawbi, N.: Clustering spam emails into campaigns. In: Library, S.D. (ed.) 1st International Conference on Information Systems Security and Privacy (2015)Google Scholar
  28. 28.
    Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., Martinelli, F.: Digital waste sorting: a goal-based, self-learning approach to label spam email campaigns. In: Foresti, S. (ed.) STM 2015. LNCS, vol. 9331, pp. 3–19. Springer, Heidelberg (2015)CrossRefGoogle Scholar
  29. 29.
    Song, J., Inque, D., Eto, M., Kim, H., Nakao, K.: O-means: an optimized clustering method for analyzing spam based attacks. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 94, 245–254 (2011)CrossRefGoogle Scholar
  30. 30.
    Tretyakov, K.: Machine learning techniques in spam filtering. In: Data Mining Problem-Oriented Seminar, MTAT, vol. 3, pp. 60–79. Citeseer (2004)Google Scholar
  31. 31.
    Wei, C., Sprague, A., Warner, G., Skjellum, A.: Mining spam email to identify common origins for forensic application. In: Proceedings of the 2008 ACM Symposium on Applied Computing, SAC 2008, pp. 1433–1437 (2008)Google Scholar
  32. 32.
    Yang, Y., Guan, X., You, J.: Clope: A fast and effective clustering algorithm for transactional data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, pp. 682–687. ACM, New York, USA (2002)Google Scholar
  33. 33.
    Zhang, C., Chen, W., Chen, X., Warner, G.: Revealing common sources of image spam by unsupervised clustering with visual features. In: Proceedings of the 2009 ACM Symposium on Applied Computing, SAC 2009, pp. 891–892. ACM, New York, USA (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Mina Sheikhalishahi
    • 1
  • Andrea Saracino
    • 2
    Email author
  • Mohamed Mejri
    • 1
  • Nadia Tawbi
    • 1
  • Fabio Martinelli
    • 2
  1. 1.Department of Computer ScienceUniversité LavalQuebec CityCanada
  2. 2.Istituto di Informatica e TelematicaConsiglio Nazionale delle RicerchePisaItaly

Personalised recommendations