Advertisement

Applied Intelligence

, Volume 48, Issue 10, pp 3538–3556 | Cite as

Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks

  • Aliaksandr Barushka
  • Petr Hajek
Article
  • 215 Downloads

Abstract

Rapid growth in the volume of unsolicited and unwanted messages has inspired the development of many anti-spam methods. Supervised anti-spam filters using machine-learning methods have been particularly effective in categorizing spam and non-spam messages. These automatically integrate spam corpora pre-processing, appropriate word lists selection, and the calculation of word weights, usually in a bag-of-words fashion. To develop an accurate spam filter is challenging because spammers attempt to decrease the probability of spam detection by using legitimate words. Complex models are therefore needed to solve such a problem. However, existing spam filtering methods usually converge to a poor local minimum, cannot effectively handle high-dimensional data and suffer from overfitting issues. To overcome these problems, we propose a novel spam filter integrating an N-gram tf.idf feature selection, modified distribution-based balancing algorithm and a regularized deep multi-layer perceptron NN model with rectified linear units (DBB-RDNN-ReL). As demonstrated on four benchmark spam datasets (Enron, SpamAssassin, SMS spam collection and Social networking), the proposed approach enables capturing more complex features from high-dimensional data by additional layers of neurons. Another advantage of this approach is that no additional dimensionality reduction is necessary and spam dataset imbalance is addressed using a modified distribution-based algorithm. We compare the performance of the approach with that of state-of-the-art spam filters (Minimum Description Length, Factorial Design using SVM and NB, Incremental Learning C4.5, and Random Forest, Voting and Convolutional Neural Network) and several machine learning algorithms commonly used to classify text. We show that the proposed model outperforms these other methods in terms of classification accuracy, with fewer false negatives and false positives. Notably, the proposed spam filter classifies both major (legitimate) and minor (spam) classes well on personalized / non-personalized and balanced / imbalanced spam datasets. In addition, we show that the proposed model performs better than the results reported by previous studies in terms of accuracy. However, the high computational expenses related to additional hidden layers limit its application as an online spam filter and make it difficult to overcome the problem of concept drift.

Keywords

Spam filter Email SMS Social network Deep neural network Regularization Imbalanced data 

Notes

Acknowledgements

We gratefully acknowledge the help provided by constructive comments of the anonymous referees.

References

  1. 1.
    Abi-Haidar A, Rocha LM (2008) Adaptive spam detection inspired by the immune system. In: Artificial life XI, proceedings of the 11th international conference on the simulation and synthesis of living systems, pp 1–8.  https://doi.org/10.1007/978-3-540-85072-4
  2. 2.
    Ahmed I, Ali R, Guan D, Lee YK, Lee S, Chung T (2015) Semi-supervised learning using frequent itemset and ensemble learning for SMS classification. Expert Syst Appl 42(3):1065–1073.  https://doi.org/10.1016/j.eswa.2014.08.054 CrossRefGoogle Scholar
  3. 3.
    Almeida TA, Almeida J, Yamakami A (2011) Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers. J Internet Serv Appl 1(3):183–200.  https://doi.org/10.1007/s13174-010-0014-7 CrossRefGoogle Scholar
  4. 4.
    Almeida TA, Hidalgo JMG, Yamakami A (2011) Contributions to the study of SMS spam filtering: new collection and results. In: Proceedings of the 11th ACM symposium on document engineering, pp 259–262.  https://doi.org/10.1145/2034691.2034742
  5. 5.
    Almeida TA, Yamakami A (2012) Occam’s razor-based spam filter. J Internet Serv Appl 3(3):245–253.  https://doi.org/10.1007/s13174-012-0067-x CrossRefGoogle Scholar
  6. 6.
    Almeida TA, Yamakami A (2016) Compression-based spam filter. Secur Commun Netw 9(4):327–335.  https://doi.org/10.1002/sec.639 CrossRefGoogle Scholar
  7. 7.
    Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD (2000) An experimental comparison of Naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd annual int ACM SIGIR conference on research and development in information retrieval, pp 160–167.  https://doi.org/10.1145/345508.345569
  8. 8.
    Aragão MV, Frigieri EP, Ynoguti CA, Paiva AP (2016) Factorial design analysis applied to the performance of SMS anti-spam filtering systems. Expert Syst Appl 64:589–604.  https://doi.org/10.1016/j.eswa.2016.08.038 CrossRefGoogle Scholar
  9. 9.
    Barushka A, Hajek P (2016) Spam filtering using regularized neural networks with rectified linear units. In: AI*IA 2016 advances in artificial intelligence. Springer, pp 65–75.  https://doi.org/10.1007/978-3-319-49130-1_6
  10. 10.
    Basto-Fernandes V, Yevseyeva I, Méndez JR, Zhao J, Fdez-Riverola F, Emmerich MT (2016) A spam filtering multi-objective optimization study covering parsimony maximization and three-way classification. Appl Soft Comput 48:111–123.  https://doi.org/10.1016/j.asoc.2016.06.043 CrossRefGoogle Scholar
  11. 11.
    Bermejo P, Gámez JA, Puerta JM (2011) Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets. Expert Syst Appl 38(3):2072–2080.  https://doi.org/10.1016/j.eswa.2010.07.146 CrossRefGoogle Scholar
  12. 12.
    Bermejo P, Gámez JA, Puerta JM (2014) Speeding up incremental wrapper feature subset selection with Naive Bayes classifier. Knowl-Based Syst 55:140–147.  https://doi.org/10.1016/j.knosys.2013.10.016 CrossRefGoogle Scholar
  13. 13.
    Bosma M, Meij E, Weerkamp W (2012) A framework for unsupervised spam detection in social networking sites. In: European conference on information retrieval. Springer, Berlin, pp 364–375.  https://doi.org/10.1007/978-3-642-28997-2_31
  14. 14.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32.  https://doi.org/10.1023/A:1010933404324 CrossRefzbMATHGoogle Scholar
  15. 15.
    Carpinter J, Hunt R (2006) Tightening the net: a review of current and next generation spam filtering tools. Comput Secur 25(8):566–578.  https://doi.org/10.1016/j.cose.2006.06.001 CrossRefGoogle Scholar
  16. 16.
    Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. In: Proceedings of RANLP 2001, bulgaria, pp 58–64Google Scholar
  17. 17.
    Caruana G, Li M (2012) A survey of emerging approaches to spam filtering. ACM Comput Surv 44(2):1–27.  https://doi.org/10.1145/2089125.2089129 CrossRefGoogle Scholar
  18. 18.
    Chhogyal K, Nayak A (2016) An empirical study of a simple Naive Bayes classifier based on ranking functions. In: Australasian joint conference on artificial intelligence. Springer, pp 324–331.  https://doi.org/10.1007/978-3-319-50127-7_27
  19. 19.
    Clark J, Koprinska I, Poon J (2003) A neural network based approach to automated e-mail classification. In: Proceedings of the IEEE/WIC international conference on web intell (WI’03). IEEE, pp 702–705.  https://doi.org/10.1109/WI.2003.1241300
  20. 20.
    Cormack GV (2006) Email spam filtering: a systematic review. Found Trends Inf Retr 1(4):335–455.  https://doi.org/10.1561/1500000006 CrossRefGoogle Scholar
  21. 21.
    Delany SJ, Buckley M, Greene D (2012) SMS spam filtering: methods and data. Expert Syst Appl 39 (10):9899–9908.  https://doi.org/10.1016/j.eswa.2012.02.053 CrossRefGoogle Scholar
  22. 22.
    Dhillon IS, Mallela S, Kumar R (2003) A divisive information-theoretic feature clustering algorithm for text classification. J Mach Learn Res 3:1265–1287.  https://doi.org/10.1162/153244303322753661 MathSciNetzbMATHGoogle Scholar
  23. 23.
    Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054.  https://doi.org/10.1109/72.788645 CrossRefGoogle Scholar
  24. 24.
    El Boujnouni M (2017) SMS spam filtering using N-gram method, information gain metric and an improved version of SVDD classifier. J Eng Sci Technol Rev 10(1):131–137Google Scholar
  25. 25.
    Fang A (2016) Applications of the maximum entropy principle in spam email classification. J Residuals Sci Technol 13(6):1–4.  https://doi.org/10.12783/issn.1544-8053/13/6/1 Google Scholar
  26. 26.
    Fawcett T (2003) In vivo spam filtering: a challenge problem for KDD. ACM SIGKDD Explor Newsl 5(2):140–148.  https://doi.org/10.1145/980972.980990 CrossRefGoogle Scholar
  27. 27.
    Fdez-Riverola F, Iglesias EL, Diaz F, Méndez JR, Corchado JM (2007) Spamhunting: an instance-based reasoning system for spam labelling and filtering. Dec Supp Syst 43(3):722–736.  https://doi.org/10.1016/j.dss.2006.11.012 CrossRefGoogle Scholar
  28. 28.
    Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. Journal-Japanese Soc For Artif Intell 14(5):771–780Google Scholar
  29. 29.
    Garcia S, Fernandez A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064.  https://doi.org/10.1016/j.ins.2009.12.010 CrossRefGoogle Scholar
  30. 30.
    Gheyas IA, Smith LS (2010) Feature subset selection in large dimensionality domains. Pattern Recognit 43(1):5–13.  https://doi.org/10.1016/j.patcog.2009.06.009 CrossRefzbMATHGoogle Scholar
  31. 31.
    Guzella T, Caminhas W (2009) A review of machine learning approaches to spam filtering. Expert Syst Appl 36(7):10206–10222.  https://doi.org/10.1016/j.eswa.2009.02.037 CrossRefGoogle Scholar
  32. 32.
    Hagenau M, Liebmann M, Neumann D (2013) Automated news reading: stock price prediction based on financial news using context-capturing features. Dec Supp Syst 55(3):685–697.  https://doi.org/10.1016/j.dss.2013.02.006 CrossRefGoogle Scholar
  33. 33.
    Hassan D (2016) Investigating the effect of combining text clustering with classification on improving spam email detection. In: Madureira A, Abraham A, Gamboa D, Novais P (eds) International conference on intelligent systems design and applications. Springer, Cham, pp 99–107.  https://doi.org/10.1007/978-3-319-53480-0_10
  34. 34.
    Henning JL (2006) SPEC CPU2006 Benchmark descriptions. ACM SIGARCH Comput Archit News 34 (4):1–17.  https://doi.org/10.1145/1186736.1186737 CrossRefGoogle Scholar
  35. 35.
    Hinton G, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580
  36. 36.
    Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE T Pattern Anal 24(3):289–300.  https://doi.org/10.1109/34.990132 CrossRefGoogle Scholar
  37. 37.
    Hoanca B (2006) How good are our weapons in the spam wars? IEEE Technol Soc Mag 25(1):22–30.  https://doi.org/10.1109/MTAS.2006.1607720 CrossRefGoogle Scholar
  38. 38.
    Jaitly N, Hinton G (2011) Learning a better representation of speech soundwaves using restricted Boltzmann machines, pp 5884–5887. In: IEEE international conference on acoustics, speech and signal processing (ICASSP).  https://doi.org/10.1109/ICASSP.2011.5947700
  39. 39.
    Jiang S, Pang G, Wu M, Kuang L (2012) An improved k-nearest-neighbor algorithm for text categorization. Expert Syst Appl 39(1):1503–1509.  https://doi.org/10.1016/j.eswa.2011.08.040 CrossRefGoogle Scholar
  40. 40.
    Kaya Y, Ertuğrul ÖF (2016) A novel approach for spam email detection based on shifted binary patterns. Secur Commun Netw 9(10):1216–1225.  https://doi.org/10.1002/sec.1412 CrossRefGoogle Scholar
  41. 41.
    Khan A, Baharudin B, Lee L (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1(1):4–20.  https://doi.org/10.1016/j.eswa.2011.08.040 Google Scholar
  42. 42.
    Khorshidpour Z, Hashemi S, Hamzeh A (2017) Evaluation of random forest classifier in security domain. Appl Intell.  https://doi.org/10.1007/s10489-017-0907-2
  43. 43.
    Kim Y (2014) Convolutional neural networks for sentence classification. arXiv:1408.5882
  44. 44.
    Koprinska I, Poon J, Clark J, Chan J (2007) Learning to classify e-mail. Inf Sci 177(10):2167–2187.  https://doi.org/10.1016/j.ins.2006.12.005 CrossRefGoogle Scholar
  45. 45.
    Lai C (2007) An empirical study of three machine learning methods for spam filtering. Knowl-Based Syst 20(3):249–254.  https://doi.org/10.1016/j.knosys.2006.05.016 CrossRefGoogle Scholar
  46. 46.
    Laorden C, Ugarte-Pedrero X, Santos I, Sanz B, Nieves J, Bringas PG (2014) Study on the effectiveness of anomaly detection for spam filtering. Inf Sci 277:421–444.  https://doi.org/10.1016/j.ins.2014.02.114 CrossRefGoogle Scholar
  47. 47.
    LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324.  https://doi.org/10.1109/5.726791 CrossRefGoogle Scholar
  48. 48.
    Liu Y, Wang Y, Feng L, Zhu X (2016) Term frequency combined hybrid feature selection method for spam filtering. Pattern Anal Applic 19(2):369–383.  https://doi.org/10.1016/j.asoc.2016.06.043 MathSciNetCrossRefGoogle Scholar
  49. 49.
    Liu AC (2004) The effect of oversampling and undersampling on classifying imbalanced text datasets. The University of Texas at Austin, Austin. https://doi.org/10.1.1.101.5878 Google Scholar
  50. 50.
    Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the 30th international conference on machine learning, vol 30, pp 1–6Google Scholar
  51. 51.
    Méndez J, Corzo B, Glez-Peña D, Fdez-Riverola F, Díaz F (2007) Analyzing the performance of spam filtering methods when dimensionality of input vector changes. In: Perner P (ed) Machine learning and data mining in pattern recognition. Springer, Berlin, pp 364–378.  https://doi.org/10.1007/978-3-540-73499-4_28
  52. 52.
    Metsis V, Androutsopoulos I, Paliouras G (2006) Spam filtering with Naive Bayes - which Naive Bayes?. In: Third conference on email and antispam (CEAS), pp 27–28. https://doi.org/10.1.1.61.5542
  53. 53.
    Mishra R, Thakur RS (2013) Analysis of random forest and Naive Bayes for spam mail using feature selection catagorization. Int J Comput Appl 80(3):42–47Google Scholar
  54. 54.
    Nagwani NK, Sharaff A (2017) SMS spam filtering and thread identification using bi-level text classification and clustering techniques. J Inf Sci 43(1):75–87.  https://doi.org/10.1177/0165551515616310 CrossRefGoogle Scholar
  55. 55.
    Najadat H, Abdulla N, Abooraig R, Nawasrah S (2016) Spam detection for mobile short messaging service using data mining classifiers. Int J Comput Sci Inf Secur 14(8):511–517Google Scholar
  56. 56.
    Nam J, Kim J, Mencía EL, Gurevych I, Fürnkranz J (2014) Large-scale multi-label text classification - revisiting neural networks. In: Calders T, Esposito F, Hüllermeier E, Melo R (eds) Machine learning and knowledge discovery in databases. Springer, Berlin, pp 437–452.  https://doi.org/10.1007/978-3-662-44851-9_28
  57. 57.
    Obied A, Alhajj R (2009) Fraudulent and malicious sites on the web. Appl Intell 30(2):112–120.  https://doi.org/10.1007/s10489-007-0102-y CrossRefGoogle Scholar
  58. 58.
    Rozza A, Lombardi G, Casiraghi E (2009) Novel IPCA-based classifiers and their application to spam filtering. In: Ninth international conference on intelligent systems design and applications, ISDA’09. IEEE, pp 797–802.  https://doi.org/10.1109/ISDA.2009.21
  59. 59.
    Quinlan JR (1996) Improved use of continuous attributes in c4. 5. J Artificial Intell Res 4:77–90.  https://doi.org/10.1613/jair.279 CrossRefzbMATHGoogle Scholar
  60. 60.
    Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Learn for text categorization, papers from the 1998 workshop, vol 62, pp 98–105. https://doi.org/10.1.1.48.1254
  61. 61.
    Sanghani G, Kotecha K (2016) Personalized spam filtering using incremental training of support vector machine. IEEE, pp 323–328. In: International conference on computing, analytics and security trends (CAST).  https://doi.org/10.1109/CAST.2016.7914988
  62. 62.
    Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47.  https://doi.org/10.1145/505282.505283 CrossRefGoogle Scholar
  63. 63.
    Shams R, Mercer RE (2013) Personalized spam filtering with natural language attributes. In: 12th international conference on machine learning and applications (ICMLA), vol 2. IEEE, pp 127–132.  https://doi.org/10.1109/ICMLA.2013.117
  64. 64.
    Shams R, Mercer RE (2016) Supervised classification of spam emails with natural language stylometry. Neural Comput Appl 27(8):2315–2331.  https://doi.org/10.1007/s00521-015-2069-7 CrossRefGoogle Scholar
  65. 65.
    Shen H, Li Z (2014) Leveraging social networks for effective spam filtering. IEEE Trans Comput 63(11):2743–2759.  https://doi.org/10.1109/TC.2013.152 MathSciNetCrossRefzbMATHGoogle Scholar
  66. 66.
    Sheu JJ, Chen YK, Chu KT, Tang JH, Yang WP (2016) An intelligent three-phase spam filtering method based on decision tree data mining. Secur Commun Netw 9(17):4013–4026.  https://doi.org/10.1002/sec.1584 CrossRefGoogle Scholar
  67. 67.
    Sheu JJ, Chu KT, Li NF, Lee CC (2017) An efficient incremental learning mechanism for tracking concept drift in spam filtering. PloS One 12(2):e0171518.  https://doi.org/10.1371/journal.pone.0171518 CrossRefGoogle Scholar
  68. 68.
    Silva RM, Alberto TC, Almeida TA, Yamakami A (2017) Towards filtering undesired short text messages using an online learning approach with semantic indexing. Expert Syst Appl 83:314–325.  https://doi.org/10.1016/j.eswa.2017.04.055 CrossRefGoogle Scholar
  69. 69.
    Talbot D (2008) Where spam is born. MIT Technol RevGoogle Scholar
  70. 70.
    Trivedi SK, Dey S (2013) An enhanced genetic programming approach for detecting unsolicited emails. In: IEEE 16th international conference on computational science and engineering (CSE), pp 1153–1160.  https://doi.org/10.1109/CSE.2013.171
  71. 71.
    Trivedi SK, Dey S (2016) A combining classifiers approach for detecting email spams. In: 30th international conference on advanced information networking and applications workshops (WAINA). IEEE, pp 355–360.  https://doi.org/10.1109/WAINA.2016.127
  72. 72.
    Trivedi SK, Dey S (2016) A comparative study of various supervised feature selection methods for spam classification. In: Proceedings of the 2nd international conference on information and communication technology for competitive strategies. ACM, p 64.  https://doi.org/10.1145/2905055.2905122
  73. 73.
    Tzortzis G, Likas A (2007) Deep belief networks for spam filtering. In: 19th IEEE international conference on tools with artificial intelligence, ICTAI 2007, vol 2. IEEE, pp 306–309.  https://doi.org/10.1109/ICTAI.2007.65
  74. 74.
    Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowl-Based Syst 36:226–235.  https://doi.org/10.1016/j.knosys.2012.06.005 CrossRefGoogle Scholar
  75. 75.
    Uysal AK, Gunal S, Ergin S, Gunal ES (2012) A novel framework for SMS spam filtering. In: 2012 international symposium on innovations in intelligent systems and applications (INISTA). IEEE, pp 1–4.  https://doi.org/10.1109/INISTA.2012.6246947
  76. 76.
    Vyas T, Prajapati P, Gadhwal S (2015) A survey and evaluation of supervised machine learning techniques for spam e-mail filtering. In: IEEE international conference on electrical, computer and communication technologies (ICECCT). IEEE, pp 1–7.  https://doi.org/10.1109/ICECCT.2015.7226077
  77. 77.
    Watkins A, Timmis J (2004) Artificial immune recognition system (AIRS): an immune-inspired supervised learning algorithm. Genet Program Evolvable Mach 5(3):291–317.  https://doi.org/10.1023/B:GENP.0000030197.83685.94 CrossRefGoogle Scholar
  78. 78.
    Wei CP, Chen HC, Cheng TH (2008) Effective spam filtering: a single-class learning and ensemble approach. Decis Supp Syst 45(3):491–503.  https://doi.org/10.1016/j.dss.2007.06.010 CrossRefGoogle Scholar
  79. 79.
    Wu CH, Tsai CH (2009) Robust classification for spam filtering by back-propagation neural networks using behavior-based features. Appl Intell 31:107–121.  https://doi.org/10.1007/s10489-008-0116-0 CrossRefGoogle Scholar
  80. 80.
    Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: International conference on machine learning, vol 3, pp 856–863Google Scholar
  81. 81.
    Yu B, Xu ZB (2008) A comparative study for content-based dynamic spam classification using four machine learning algorithms. Knowl-Based Syst 21(4):355–362.  https://doi.org/10.1016/j.knosys.2008.01.001 CrossRefGoogle Scholar
  82. 82.
    Yue X, Abraham A, Chi ZX, Hao YY, Mo H (2007) Artificial immune system inspired behavior-based anti-spam filter. Soft Comput - A Fusion of Found, Methodol and Appl 11(8):729–740.  https://doi.org/10.1007/s00500-006-0116-0 Google Scholar
  83. 83.
    Zhang Y, Wang S, Phillips P, Ji G (2014) Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowl-Based Syst 64:22–31.  https://doi.org/10.1016/j.knosys.2014.03.015 CrossRefGoogle Scholar
  84. 84.
    Zhang L, Zhu J, Yao T (2004) An evaluation of statistical spam filtering techniques. ACM Trans Asian Lang Inf Process 3(4):243–269. https://doi.org/10.1.1.109.7685 CrossRefGoogle Scholar
  85. 85.
    Zheng X, Zeng Z, Chen Z, Yu Y, Rong C (2015) Detecting spammers on social networks. Neurocomputing 159:27–34.  https://doi.org/10.1016/j.neucom.2015.02.047 CrossRefGoogle Scholar
  86. 86.
    Zhou B, Yao Y, Luo J (2014) Cost-sensitive three-way email spam filtering. J Intell Inf Syst 42(1):19–45.  https://doi.org/10.1007/s10844-013-0254-7 CrossRefGoogle Scholar
  87. 87.
    Zitar RA, Hamdan A (2013) Genetic optimized artificial immune system in spam detection: a review and a model. Artif Intell Rev 40(3):305–377.  https://doi.org/10.1007/s10462-011-9285-z CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Institute of System Engineering and Informatics, Faculty of Economics and AdministrationUniversity of PardubicePardubiceCzech Republic

Personalised recommendations