Fast and Effective Clustering of Spam Emails Based on Structural Similarity

Sheikhalishahi, Mina; Saracino, Andrea; Mejri, Mohamed; Tawbi, Nadia; Martinelli, Fabio

doi:10.1007/978-3-319-30303-1_12

Mina Sheikhalishahi¹⁶,
Andrea Saracino¹⁷,
Mohamed Mejri¹⁶,
Nadia Tawbi¹⁶ &
…
Fabio Martinelli¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 9482))

Included in the following conference series:

International Symposium on Foundations and Practice of Security

777 Accesses
8 Citations

Abstract

Spam emails yearly impose extremely heavy costs in terms of time, storage space and money to both private users and companies. Finding and persecuting spammers and eventual spam emails stakeholders should allow to directly tackle the root of the problem. To facilitate such a difficult analysis, which should be performed on large amounts of unclassified raw emails, in this paper we propose a framework to fast and effectively divide large amount of spam emails into homogeneous campaigns through structural similarity. The framework exploits a set of 21 features representative of the email structure and a novel categorical clustering algorithm named Categorical Clustering Tree (CCTree). The methodology is evaluated and validated through standard tests performed on three dataset accounting to more than 200k real recent spam emails.

This research has been partially supported by EU Seventh Framework Programme (FP7/2007–2013) under grant no 610853 (COCO Cloud), MIUR-PRIN Security Horizons and Natural Sciences and Engineering Research Council of Canada (NSERC).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.cavar.me/damir/LID/.
2.
http://mathworks.com.
3.
Available at: http://security.iit.cnr.it/images/Mails/cctreesamples.zip.

References

Spam archive. http://untroubled.org/spam/
Anderson, D., Fleizach, C., Savage, S., Voelker, G.: Spamscatter: Characterizing internet scam hosting infrastructure. In: Proceedings of 16th USENIX Security Symposium (2007)
Google Scholar
Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: LIMBO: scalable clustering of categorical data. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 123–146. Springer, Heidelberg (2004)
Chapter Google Scholar
Bezdek, J., Pal, N.: Cluster validation with generalized dunn’s indices. In: Proceedings of Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems, pp. 190–193 (1995)
Google Scholar
Blanzieri, E., Bryl, A.: A survey of learning-based techniques of email spam filtering. Artif. Intell. Rev. 29(1), 63–92 (2008)
Article Google Scholar
Calais, P., Pires, D., Guedes, D., Meira, W., Hoepers, C., Steding-Jessen, K.: A campaign-based characterization of spamming strategies. In: CEAS (2008)
Google Scholar
Dinh, S., Azeb, T., Fortin, F., Mouheb, D., Debbabi, M.: Spam campaign detection, analysis, and investigation. Digit. Invest. 12(1(0)), S12–S21 (2015). DFRWS 2015 Europe Proceedings of the Second Annual DFRWS Europe
Article Google Scholar
Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. IEEE Trans. Neural Netw. 10(5), 1048–1054 (1999)
Article Google Scholar
Fisher, D.: Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2(2), 139–172 (1987)
Google Scholar
Garcia, S., Luengo, J., Saez, J.A., Lopez, V., Herrera, F.: A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013)
Article Google Scholar
Halkidi, M., Vazirgiannis, M.: Clustering validity assessment: finding the optimal partitioning of a data set. In: Proceedings of IEEE International Conference on Data Mining, ICDM 2001, pp. 187–194 (2001)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newslett. 11(1), 10–18 (2009)
Article Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2011)
Google Scholar
Hedley, J.: Jsoup cookbook (2009). http://jsoup.org/cookbook
Kerber, R.: Chimerge: discretization of numeric attributes. In: Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI 1992, pp. 123–128. AAAI Press (1992)
Google Scholar
Li, F., Hsieh, M.: An empirical study of clustering behavior of spammers and groupbased anti-spam strategies. In: CEAS 2006 Third Conference on Email and AntiSpam, pp. 27–28 (2006)
Google Scholar
Manning, C.D., Prabhakar, R., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book MATH Google Scholar
Martin, S., Nelson, B., Sewani, A., Chen, K., Joseph, A.D.: Analyzing behavioral features for email classification. In: CEAS (2005)
Google Scholar
Pu, C., Webb, S.: Observed trends in spam construction techniques: a case study of spam evolution. In: CEAS, pp. 104–112 (2006)
Google Scholar
Radicati, S.: Email statistics report 2013–2017 (2013). http://goo.gl/ggLntn
Ramachandran, A., Feamster, N.: Understanding the network-level behavior of spammers. ACM SIGCOMM Comput. Commun. Rev. 36(4), 291–302 (2006)
Article Google Scholar
Rao, J., Reiley, D.: On the spam campaign trail, the economics of spam. J. Econ. Perspect. 26(3), 87–110 (2012)
Article Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article MATH Google Scholar
Salvador, S., Chan, P.: Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004, pp. 576–584. IEEE Computer Society, Washington, DC (2004)
Google Scholar
Seewald, A.: An evaluation of naive bayes variants in content-based learning for spam filtering. Intell. Data Anal. 11(5), 497–524 (2007)
Google Scholar
Shannon, C.E.: A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001)
Article Google Scholar
Sheikhalishahi, M., Mejri, M., Tawbi, N.: Clustering spam emails into campaigns. In: Library, S.D. (ed.) 1st International Conference on Information Systems Security and Privacy (2015)
Google Scholar
Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., Martinelli, F.: Digital waste sorting: a goal-based, self-learning approach to label spam email campaigns. In: Foresti, S. (ed.) STM 2015. LNCS, vol. 9331, pp. 3–19. Springer, Heidelberg (2015)
Chapter Google Scholar
Song, J., Inque, D., Eto, M., Kim, H., Nakao, K.: O-means: an optimized clustering method for analyzing spam based attacks. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 94, 245–254 (2011)
Article Google Scholar
Tretyakov, K.: Machine learning techniques in spam filtering. In: Data Mining Problem-Oriented Seminar, MTAT, vol. 3, pp. 60–79. Citeseer (2004)
Google Scholar
Wei, C., Sprague, A., Warner, G., Skjellum, A.: Mining spam email to identify common origins for forensic application. In: Proceedings of the 2008 ACM Symposium on Applied Computing, SAC 2008, pp. 1433–1437 (2008)
Google Scholar
Yang, Y., Guan, X., You, J.: Clope: A fast and effective clustering algorithm for transactional data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002, pp. 682–687. ACM, New York, USA (2002)
Google Scholar
Zhang, C., Chen, W., Chen, X., Warner, G.: Revealing common sources of image spam by unsupervised clustering with visual features. In: Proceedings of the 2009 ACM Symposium on Applied Computing, SAC 2009, pp. 891–892. ACM, New York, USA (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Université Laval, Quebec City, Canada
Mina Sheikhalishahi, Mohamed Mejri & Nadia Tawbi
Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Pisa, Italy
Andrea Saracino & Fabio Martinelli

Authors

Mina Sheikhalishahi
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Saracino
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Mejri
View author publications
You can also search for this author in PubMed Google Scholar
Nadia Tawbi
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Martinelli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrea Saracino .

Editor information

Editors and Affiliations

Télécom SudParis, Evry, France
Joaquin Garcia-Alfaro
School of Computer Science, Carleton University, Ottawa, Ontario, Canada
Evangelos Kranakis
École des Mines de Nancy, Université de Lorraine, Nancy Cedex, France
Guillaume Bonfante

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sheikhalishahi, M., Saracino, A., Mejri, M., Tawbi, N., Martinelli, F. (2016). Fast and Effective Clustering of Spam Emails Based on Structural Similarity. In: Garcia-Alfaro, J., Kranakis, E., Bonfante, G. (eds) Foundations and Practice of Security. FPS 2015. Lecture Notes in Computer Science(), vol 9482. Springer, Cham. https://doi.org/10.1007/978-3-319-30303-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-30303-1_12
Published: 25 February 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30302-4
Online ISBN: 978-3-319-30303-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics