Investigating the Effect of Combining Text Clustering with Classification on Improving Spam Email Detection

Hassan, Doaa

doi:10.1007/978-3-319-53480-0_10

Investigating the Effect of Combining Text Clustering with Classification on Improving Spam Email Detection

Doaa Hassan¹⁸

Conference paper
First Online: 23 February 2017

1697 Accesses
6 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 557))

Abstract

Nowadays emails have been an easy and fast tool of communication among people. As a result, filtering unsolicited/spam emails has become a very important challenge to achieve. Recently there has been some research work in text mining that combines text clustering with classification to improve the classification performance. In this paper, we investigate the effect of combining text clustering using K-means algorithm with various supervised classification mechanisms on improving the performance of classification of emails into spam or non-spam. The conjunction of clustering and classification mechanisms is carried out by adding extra features from the clustering step to the feature space used for classification. Our results show that combining K-means clustering with supervised classification by this methodology does not always improve the classification performance. Moreover, for the cases that the classifiers performance is improved by clustering, we found that the performance of classifiers in terms of accuracy is slightly increased with a very small amount that does not meet the increase in the time taken for building a learning model that combines both mechanisms. The result of our experiment has been shown using the Enron-Spam datasets.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Weiss, S.M., et al.: Overview of text mining. In: Weiss, S.M., Indurkhya, N., Zhang, T. (eds.) Fundamentals of Predictive Text Mining, Chap. 1, pp. 1–12. Springer, London (2010)
Chapter Google Scholar
Hamerly, G., Elkan, C.: Alternatives to the k-means algorithm that find better clusterings. In Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM) (2002)
Google Scholar
Sasaki, M., Shinnou, H.: Spam detection using text clustering. In: Proceedings of the 2005 International Conference on Cyberworlds (CW 2005), pp. 316–319 (2005)
Google Scholar
Kyriakopoulou, A., Kalamboukis, T.: Using clustering to enhance text classification. In: Proceedings of SIGIR 2007, Amsterdam, The Netherlands (2007)
Google Scholar
Kyriakopoulou, A., Kalamboukis, T.: Combining clustering with classification for spam detection in social bookmarking systems. In: ECML/PKDD Discovery Challenge (2008)
Google Scholar
Tretyakov, K.: Machine learning techniques in spam filtering. In: Data Mining Problem-oriented Seminar, Institute of Computer Science, University of Tartu, pp. 60–79 (2004)
Google Scholar
Basavaraju, M., Prabhakar, R.: A novel method of spam mail detection using text based clustering approach. Int. J. Comput. Appl. 5(4), 15–25 (2010)
Google Scholar
Alsmadi, I., Alhami, I.: Clustering and classification of email contents. J. King Saud Univ. Comput. Inf. Sci. 27(1), 46–57 (2015)
Google Scholar
Klimt, B., Yang, Y.: The enron corpus: a new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004). doi:10.1007/978-3-540-30115-8_22
Chapter Google Scholar
Elssied, N.O.F., Ibrahim, O., Abu-Ulbeh, W.: An improved of spam e-mail classification mechanism using k-means clustering. J. Theor. Appl. Inf. Technol. 60(3), 568–580 (2014)
Google Scholar
Set, Spam Base Data. https://archive.ics.uci.edu/ml/datasets/Spambase
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. J. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley, Boston (2005)
Google Scholar
Frakes, W.B., Baeza-Yates, R.: Information Retrieval: Data Structures and Algorithms. Prentice-Hall Inc., Upper Saddle River (1992)
Google Scholar
Text categorization with WEKA. https://weka.wikispaces.com/Text+categorization+with+WEKA
Bouckaert, R.R., Frank, E., Hall, M., Kirkby, R., Reutemann, P., Seewald, A., Scuse, D.: WEKA Manual for Version 3-6-8. University of Waikato, Hamilton, New Zealand (2012)
Google Scholar
Hidalgo, J.M.G.: Text mining in WEKA: chaining filters and classifiers, January 2013
Google Scholar
Sarukkai, R.R.: Foundations of Web Technology. The Springer International Series in Engineering and Computer Science, vol. 698. Springer, Heidelberg (2002)
Book Google Scholar
Teknomo, K.: K-means clustering tutorial. http://people.revoledu.com/kardi/tutorial/kMean/index.html. Accessed July 2007

Download references

Author information

Authors and Affiliations

Department of Computers and Systems, National Telecommunication Institute, 5 Mahmoud El Miligy Street, 6th district-Nasr City, Cairo, Egypt
Doaa Hassan

Authors

Doaa Hassan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Doaa Hassan .

Editor information

Editors and Affiliations

Departamento de Engenharia Informática, Instituto Superior de Engenharia do Port, Porto, Portugal
Ana Maria Madureira
Scientific Network for Innovation and Research Excellence, Machine Intelligence Research Labs, Auburn, Washington, USA
Ajith Abraham
Polytechnic Institute of Porto, Felgueiras, Portugal
Dorabela Gamboa
Campus of Gualtar, University of Minho, Braga, Portugal
Paulo Novais

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hassan, D. (2017). Investigating the Effect of Combining Text Clustering with Classification on Improving Spam Email Detection. In: Madureira, A., Abraham, A., Gamboa, D., Novais, P. (eds) Intelligent Systems Design and Applications. ISDA 2016. Advances in Intelligent Systems and Computing, vol 557. Springer, Cham. https://doi.org/10.1007/978-3-319-53480-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-53480-0_10
Published: 23 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-53479-4
Online ISBN: 978-3-319-53480-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics