Skip to main content

Fast and Effective Clustering of Spam Emails Based on Structural Similarity

  • Conference paper
  • First Online:
Book cover Foundations and Practice of Security (FPS 2015)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 9482))

Included in the following conference series:

Abstract

Spam emails yearly impose extremely heavy costs in terms of time, storage space and money to both private users and companies. Finding and persecuting spammers and eventual spam emails stakeholders should allow to directly tackle the root of the problem. To facilitate such a difficult analysis, which should be performed on large amounts of unclassified raw emails, in this paper we propose a framework to fast and effectively divide large amount of spam emails into homogeneous campaigns through structural similarity. The framework exploits a set of 21 features representative of the email structure and a novel categorical clustering algorithm named Categorical Clustering Tree (CCTree). The methodology is evaluated and validated through standard tests performed on three dataset accounting to more than 200k real recent spam emails.

This research has been partially supported by EU Seventh Framework Programme (FP7/2007–2013) under grant no 610853 (COCO Cloud), MIUR-PRIN Security Horizons and Natural Sciences and Engineering Research Council of Canada (NSERC).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.cavar.me/damir/LID/.

  2. 2.

    http://mathworks.com.

  3. 3.

    Available at: http://security.iit.cnr.it/images/Mails/cctreesamples.zip.

References

  1. Spam archive. http://untroubled.org/spam/

  2. Anderson, D., Fleizach, C., Savage, S., Voelker, G.: Spamscatter: Characterizing internet scam hosting infrastructure. In: Proceedings of 16th USENIX Security Symposium (2007)

    Google Scholar 

  3. Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: LIMBO: scalable clustering of categorical data. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 123–146. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  4. Bezdek, J., Pal, N.: Cluster validation with generalized dunn’s indices. In: Proceedings of Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems, pp. 190–193 (1995)

    Google Scholar 

  5. Blanzieri, E., Bryl, A.: A survey of learning-based techniques of email spam filtering. Artif. Intell. Rev. 29(1), 63–92 (2008)

    Article  Google Scholar 

  6. Calais, P., Pires, D., Guedes, D., Meira, W., Hoepers, C., Steding-Jessen, K.: A campaign-based characterization of spamming strategies. In: CEAS (2008)

    Google Scholar 

  7. Dinh, S., Azeb, T., Fortin, F., Mouheb, D., Debbabi, M.: Spam campaign detection, analysis, and investigation. Digit. Invest. 12(1(0)), S12–S21 (2015). DFRWS 2015 Europe Proceedings of the Second Annual DFRWS Europe

    Article  Google Scholar 

  8. Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. IEEE Trans. Neural Netw. 10(5), 1048–1054 (1999)

    Article  Google Scholar 

  9. Fisher, D.: Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2(2), 139–172 (1987)

    Google Scholar 

  10. Garcia, S., Luengo, J., Saez, J.A., Lopez, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)

    Article  Google Scholar 

  11. Halkidi, M., Vazirgiannis, M.: Clustering validity assessment: finding the optimal partitioning of a data set. In: Proceedings of IEEE International Conference on Data Mining, ICDM 2001, pp. 187–194 (2001)

    Google Scholar 

  12. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newslett. 11(1), 10–18 (2009)

    Article  Google Scholar 

  13. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2011)

    Google Scholar 

  14. Hedley, J.: Jsoup cookbook (2009). http://jsoup.org/cookbook

  15. Kerber, R.: Chimerge: discretization of numeric attributes. In: Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI 1992, pp. 123–128. AAAI Press (1992)

    Google Scholar 

  16. Li, F., Hsieh, M.: An empirical study of clustering behavior of spammers and groupbased anti-spam strategies. In: CEAS 2006 Third Conference on Email and AntiSpam, pp. 27–28 (2006)

    Google Scholar 

  17. Manning, C.D., Prabhakar, R., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  MATH  Google Scholar 

  18. Martin, S., Nelson, B., Sewani, A., Chen, K., Joseph, A.D.: Analyzing behavioral features for email classification. In: CEAS (2005)

    Google Scholar 

  19. Pu, C., Webb, S.: Observed trends in spam construction techniques: a case study of spam evolution. In: CEAS, pp. 104–112 (2006)

    Google Scholar 

  20. Radicati, S.: Email statistics report 2013–2017 (2013). http://goo.gl/ggLntn

  21. Ramachandran, A., Feamster, N.: Understanding the network-level behavior of spammers. ACM SIGCOMM Comput. Commun. Rev. 36(4), 291–302 (2006)

    Article  Google Scholar 

  22. Rao, J., Reiley, D.: On the spam campaign trail, the economics of spam. J. Econ. Perspect. 26(3), 87–110 (2012)

    Article  Google Scholar 

  23. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  24. Salvador, S., Chan, P.: Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004, pp. 576–584. IEEE Computer Society, Washington, DC (2004)

    Google Scholar 

  25. Seewald, A.: An evaluation of naive bayes variants in content-based learning for spam filtering. Intell. Data Anal. 11(5), 497–524 (2007)

    Google Scholar 

  26. Shannon, C.E.: A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)

    Article  Google Scholar 

  27. Sheikhalishahi, M., Mejri, M., Tawbi, N.: Clustering spam emails into campaigns. In: Library, S.D. (ed.) 1st International Conference on Information Systems Security and Privacy (2015)

    Google Scholar 

  28. Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., Martinelli, F.: Digital waste sorting: a goal-based, self-learning approach to label spam email campaigns. In: Foresti, S. (ed.) STM 2015. LNCS, vol. 9331, pp. 3–19. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  29. Song, J., Inque, D., Eto, M., Kim, H., Nakao, K.: O-means: an optimized clustering method for analyzing spam based attacks. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 94, 245–254 (2011)

    Article  Google Scholar 

  30. Tretyakov, K.: Machine learning techniques in spam filtering. In: Data Mining Problem-Oriented Seminar, MTAT, vol. 3, pp. 60–79. Citeseer (2004)

    Google Scholar 

  31. Wei, C., Sprague, A., Warner, G., Skjellum, A.: Mining spam email to identify common origins for forensic application. In: Proceedings of the 2008 ACM Symposium on Applied Computing, SAC 2008, pp. 1433–1437 (2008)

    Google Scholar 

  32. Yang, Y., Guan, X., You, J.: Clope: A fast and effective clustering algorithm for transactional data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, pp. 682–687. ACM, New York, USA (2002)

    Google Scholar 

  33. Zhang, C., Chen, W., Chen, X., Warner, G.: Revealing common sources of image spam by unsupervised clustering with visual features. In: Proceedings of the 2009 ACM Symposium on Applied Computing, SAC 2009, pp. 891–892. ACM, New York, USA (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrea Saracino .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., Martinelli, F. (2016). Fast and Effective Clustering of Spam Emails Based on Structural Similarity. In: Garcia-Alfaro, J., Kranakis, E., Bonfante, G. (eds) Foundations and Practice of Security. FPS 2015. Lecture Notes in Computer Science(), vol 9482. Springer, Cham. https://doi.org/10.1007/978-3-319-30303-1_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30303-1_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-30302-4

  • Online ISBN: 978-3-319-30303-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics