Machine intelligence-based algorithms for spam filtering on document labeling

  • Devottam Gaurav
  • Sanju Mishra TiwariEmail author
  • Ayush Goyal
  • Niketa Gandhi
  • Ajith Abraham
Methodologies and Application


The internet has provided numerous modes for secure data transmission from one end station to another, and email is one of those. The reason behind its popular usage is its cost-effectiveness and facility for fast communication. In the meantime, many undesirable emails are generated in a bulk format for a monetary benefit called spam. Despite the fact that people have the ability to promptly recognize an email as spam, performing such task may waste time. To simplify the classification task of a computer in an automated way, a machine learning method is used. Due to limited availability of datasets for email spam, constrained data and the text written in an informal way are the most feasible issues that forced the current algorithms to fail to meet the expectations during classification. This paper proposed a novel, spam mail detection method based on the document labeling concept which classifies the new ones into ham or spam. Moreover, algorithms like Naive Bayes, Decision Tree and Random Forest (RF) are used in the classification process. Three datasets are used to evaluate how the proposed algorithm works. Experimental results illustrate that RF has higher accuracy when compared with other methods.


Machine learning Spam detection Document labeling Feature selection 



This study was not funded by any grant.

Compliance with ethical standards

Conflict of Interest

The authors have declare that they have no conflict of interest.

Human animal rights

No animals were involved. This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.


  1. Ahuja L (2018) Handling web spamming using logic approach. In: International conference on advances in computing and data sciences. Springer, Singapore, pp 380–387Google Scholar
  2. Attenberg J, Weinberger K, Dasgupta A, Smola A, Zinkevich M (2009) Collaborative email-spam filtering with the hashing trick. In: Proceedings of the sixth conference on email and anti-spamGoogle Scholar
  3. Bassiouni M, Ali M, El-Dahshan EA (2018) Ham and spam e-mails classification using machine learning techniques. J Appl Secur Res 13(3):315–331CrossRefGoogle Scholar
  4. Bhat SY, Abulaish M, Mirza AA (2014) Spammer classification using ensemble methods over structural social network features. In: Proceedings of the 2014 IEEE/WIC/ACM international joint conferences on web intelligence (WI) and intelligent agent technologies (IAT), vol 02. IEEE Computer SocietyGoogle Scholar
  5. Camastra F, Ciaramella A, Staiano A (2013) Machine learning and soft computing for ICT security: an overview of current trends. J Ambient Intell Humaniz Comput 4:235–247CrossRefGoogle Scholar
  6. Chebrolu S, Abraham A, Thomas JP (2005) Feature deduction and ensemble design of intrusion detection systems. Comput Secur 24(4):295–307CrossRefGoogle Scholar
  7. Christina V, Karpagavalli S, Suganya G (2010) A study on email spam filtering techniques. Int J Comput Appl 12(1):0975–8887Google Scholar
  8. DCC Spam Control Delayed Your E-Mail. Accessed 20 Dec 2018
  9. Gaurav D, Yadav JKPS, Kaliyar RK, Goyal A (2019) Detection of false positive situation in review mining. Soft Computing and signal processing. Springer, Singapore, pp 83–90Google Scholar
  10. Gupta S, Kumar P, Abraham A (2013) A profile based network intrusion detection and prevention system for securing cloud environment. Int J Distrib Sensor Netw 9(3):364575CrossRefGoogle Scholar
  11. Herrero A, Corchado E, Pellicer MA, Abraham A (2009) MOVIH-IDS: a mobile-visualization hybrid intrusion detection system. Neurocomputing 72(13–15):2775–2784CrossRefGoogle Scholar
  12. Staiano A, Di Taranto MD, Bloise E, Agostino MND, D’Angelo A, Marotta G, Gentile M, Jossa F, Iannuzzi A, Rubba P, Fortunato G (2013) Investigation of single nucleotide polymorphisms associated to familial combined hyperlipidemia with random forests. In: Neural nets and surroundings. Springer, Berlin, Heidelberg, pp 169–178CrossRefGoogle Scholar
  13. Kim D, Deokseong S, Suhyoun C, Pilsung K (2019) Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Inf Sci 477:15–19CrossRefGoogle Scholar
  14. Kumar RK, Poonkuzhali G, Sudhakar P (2012) Comparative study on email spam classifier using data mining techniques. In: Proceedings of the international multi-conference of engineers and computer scientists, vol 1, Hong Kong, pp 14–16Google Scholar
  15. Liu TJ, Tsao WL, Lee CL (2010) A high performance image-spam filtering system. In: 2010 ninth international symposium on distributed computing and applications to business engineering and science (DCABES). IEEE, pp 445-449Google Scholar
  16. Merugu S, Reddy MCS, Goyal E, Piplani L (2019) Text message classification using supervised machine learning algorithms. In: Kumar A, Mozar S (eds) ICCCE 2018. ICCCE 2018. Lecture Notes in Electrical Engineering, vol 500. Springer, Singapore, p 2019Google Scholar
  17. Microsoft Sender ID Framework. Accessed 14 Jan 2019
  18. Mishra S, Sagban R, Yakoob A, Gandhi N (2018) Swarm intelligence in anomaly detection systems: an overview. Int J Comput Appl 1–10. (2018)Google Scholar
  19. Nizamani S, Memon N, Wiil UK, Karampelas P (2013) Modeling suspicious email detection using enhanced feature selection. arXiv:1312.1971
  20. Oliveira JP (2019) Spam dataset analysis. Accessed 08 Aug 2019
  21. Park YW, Klabjan D (2018) Three iteratively reweighted least squares algorithms for L1-norm principal component analysis. Knowl Inf Syst 54(3):541–565CrossRefGoogle Scholar
  22. Pyzor’s homepage. Accessed 14 Dec 2018
  23. Radev D (2008) CLAIR collection of fraud email, ACL data and code repository. ADCR2008T001Google Scholar
  24. Razor’s homepage. Accessed on 05 Dec 2018
  25. Sarwat N, Menon N, Glasdam M, Nguyen DD (2014) Detection of fraudulent emails by employing advanced feature abundance. Egypt Inform J 15:169–174CrossRefGoogle Scholar
  26. Sender Policy Framework. Accessed 24 Jan 2019
  27. Sharaff A, Nagwani NK, Dhadse A (2016) Comparative study of classification algorithms for spam email detection. In: Shetty N, Prasad N, Nalini N (eds) Emerging research in computing, information, communication and applications. Springer, New DelhiGoogle Scholar
  28. Symantec Brightmail Anti-Spam. Accessed 23 Dec 2018
  29. Trivedi SK, Dey S (2013) Interplay between probabilistic classifiers and boosting algorithms for detecting complex unsolicited emails. J Adv Comput Netw 1(2):132–136CrossRefGoogle Scholar
  30. Vidya Kumari KR, Kavitha CR (2019) Spam detection using machine learning in R. In: Smys S, Bestak R, Chen JZ, Kotuliak I (eds) International conference on computer networks and communication technologies. Lecture Notes on Data Engineering and Communications Technologies, vol 15. Springer, SingaporeGoogle Scholar
  31. Yandex: Some Automatic Spam Detection Methods. Accessed 03 Jan 2019
  32. Yoon JW, Hyoungshick K, Huh JH (2010) Hybrid spam filtering for mobile communication. Comput Secur 29(4):446–459CrossRefGoogle Scholar
  33. Youn S, McLeod D (2007) A comparative study for email classification. In: Elleithy K (ed) Advances and Innovations in systems, computing sciences and software engineering. Springer, DordrechtGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  • Devottam Gaurav
    • 1
  • Sanju Mishra Tiwari
    • 2
    Email author
  • Ayush Goyal
    • 3
  • Niketa Gandhi
    • 4
  • Ajith Abraham
    • 5
  1. 1.Department of Computer Science and EngineeringChandigarh UniversityPunjabIndia
  2. 2.Ontology Engineering GroupUniversidad Polytecnica de MadridMadridSpain
  3. 3.Department of Electrical Engineering and Computer ScienceTexas A&M University - KingsvilleKingsvilleUSA
  4. 4.University of MumbaiMumbaiIndia
  5. 5.Machine Intelligence Research Labs (MIR Labs)AuburnUSA

Personalised recommendations