An efficient character recognition method using enhanced HOG for spam image detection
- 20 Downloads
Generally, a spam image is an unsolicited message electronically sent to a wide group of arbitrary addresses. Due to attractiveness and more difficult detection, spam images are the most complicated type of spam. One of the ways to encounter the spam images is an optical character recognition, OCR, method. In this paper, the proposed enhanced HOG feature extraction method has been used so that the optical character recognition system of spam has been enhanced by using the HOG feature extraction method in such a way to be both resistant against the character variations on scale and translation and to be computationally cost-effective. For these purposes, two steps of the cropped image and input image size normalization have been added to pre-processing stages. Support vector machine, SVM, was employed for classification. Two heuristic modifications including thickening of the thin characters in the pre-processing stage and non-discrimination in detecting the uppercase and lowercase letters with the same shapes in the classification stage have been also proposed to increase the system recognition accuracy. In the first heuristic modification, when all pixels of the output image are empty (the character is eliminated), the original image was made thicker by one layer. In the second modification, when recognizing the letters, no differentiation was considered between the uppercase and lowercase letters with the same shapes. An average recognition accuracy of the modified HOG method with two heuristic modifications equals 91.61% on Char74K database. Then, an optimum threshold for classification was investigated by ROC curve. The optimal cutoff point was 0.736 with the highest average accuracy, 94.20%, and AUC, area under curve, for ROC and precision–recall, PR, curves were 0.96 and 0.73, respectively. The proposed method was also examined on ICDAR2003 database, and the average accuracy and its optimum using ROC curve were 82.73% and 86.01%, respectively. These results of recognition accuracy and AUC for ROC and PR curve showed an outstanding enhancement in comparison with the best recognition rate of the previous methods.
KeywordsSpam detection OCR Histogram of oriented gradients Enhanced HOG SVM Social media Security ROC curve
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
This article does not contain any studies with human participants performed by any of the authors.
Informed consent was obtained from all individual participants included in the study.
- Alghamdi B, Watson J, Xu Y (2016) Toward detecting malicious links in online social networks through user behavior. In: IEEE/WIC/ACM international conference on web intelligence workshops (WIW). IEEE, pp 5–8Google Scholar
- Bhowmick A, Hazarika SM (2016) Machine learning for e-mail spam filtering: review, techniques and trends. arXiv preprint arXiv:1606.01042
- Bowling JR, Hope P, Liszka KJ (2008) Spam image identification using an artificial neural network. The University of Akron Akron, Ohio, pp 44003–44325Google Scholar
- Constine J (2016) Facebook climbs to 1.59 billion users and crushes Q4 estimates with $5.8B revenue. http://techcrunch.com/2016/01/27/facebook-earnings-q4-2015/. Accessed 21 July 2017
- Cumming JG (2010) The spammer’s compendium. http://www.jgc.org/tsc.html. Accessed Mar 2018
- Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Computer Society conference on computer vision and pattern recognition, CVPR 2005. IEEEGoogle Scholar
- De Campos TE, Babu BR, Varma M (2009) Character recognition in natural images. In: Proceedings of the Int’l conference on computer vision theory and applicationGoogle Scholar
- Dhanaraj S, Karthikeyani V (2013) A study on e-mail image spam filtering techniques. In: International conference on pattern recognition, informatics and mobile engineering (PRIME). IEEEGoogle Scholar
- Fumera G, Pillai I, Roli F (2006) Spam filtering based on the analysis of text information embedded into images. J Mach Learn Res 7:2699–2720Google Scholar
- Galdi P, Tagliaferri R (2019) Data mining: accuracy and error measures for classification and prediction. In: Shoba R (ed) Reference module in life sciences, Encyclopedia of Bioinformatics and Computational Biology, vol 1. Elsevier, Amsterdam, pp 1–14Google Scholar
- Gao Y, Choudhary A, Hua G (2010) A nonnegative sparsity induced similarity measure with application to cluster analysis of spam images. In: IEEE international conference on acoustics speech and signal processing (ICASSP). IEEEGoogle Scholar
- Jithesh K, Sulochana K, Kumar RR (2003) Optical character recognition (OCR) system for Malayalam language. In: National workshop on application of language technology in Indian languagesGoogle Scholar
- Krasser S, Tang Y, Gould J, Alperovitch D, Judge P (2007) Identifying image spam based on header and file properties using C4. 5 decision trees and support vector machine learning. In: Information assurance and security workshop, IAW’07. IEEE SMC, IEEEGoogle Scholar
- Liu T-J, Tsao W-L, Lee C-L (2010) A high performance image-spam filtering system. In: Ninth international symposium on distributed computing and applications to business engineering and science (DCABES). IEEEGoogle Scholar
- Lucas SM, Panaretos A, Sosa L, Tang A, Wong S, Young R (2003) ICDAR 2003 robust reading competitions. In: Seventh international conference on document analysis and recognition, proceedings. IEEE Computer SocietyGoogle Scholar
- Mehta B, Nangia S, Gupta M, Nejdl W (2008) Detecting image spam using visual features and near duplicate detection. In: Proceedings of the 17th international conference on World Wide Web. ACMGoogle Scholar
- Saraubon K, Limthanmaphon B (2009) Fast effective botnet spam detection. In: Fourth international conference on computer sciences and convergence information technology, ICCIT’09. IEEEGoogle Scholar
- Sathiya V, Divakar M, Sumi T (2011) Partial image spam e-mail detection using OCR. Int J Eng Trends Technol 1(1):55–59Google Scholar
- Smith C (2018) 400 Interesting twitter stats and facts | By the Numbers. https://expandedramblings.com/index.php/twitter-stats-facts/
- Wakade SV (2011) Classification of image spam. University of Akron, AkronGoogle Scholar
- Wang K, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: IEEE international conference on computer vision (ICCV). IEEEGoogle Scholar
- Xu Z, Wang H-G, Shao Z-Z (2009) Evaluation of image spam classification system based on AHP. In: International conference on computational intelligence and software engineering, CiSE 2009. IEEEGoogle Scholar