Abstract
The growing problem of unsolicited bulk e-mail, also known as “spam”, has generated a need for reliable anti-spam e-mail filters. We introduce seven filtering algorithms: Naive Bayesian (NB), Decision Tree (DT), AdaBoost, ANN, SVM, VSM and KNN. Design considerations and implementation issues of these filters are discussed, such as how to get cost-sensitive NB, SVM, VSM, KNN. Using two relatively large amounts of real personal E-mail data, a comprehensive comparative study based on a cost-sensitive measure we approved was conducted using above seven filters. The study includes the effect of feature subset size, training-corpus distribution, issues that have not been explored in previous experiments. The comparative results show that cost-sensitive filters such as NB, SVM, VSM and KNN have fewer count of misclassifying legitimate when relative parameters, feature subset size and training dataset’s distribution are reasonable.
Chapter PDF
References
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In M. Sahami, editor, Learning for Text Categorization: Proceedings of the 1998 AAAI/ICML Workshop, Madison, WI, 1998. AAAI Press, (1998) 41–48
Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., Spyropoulos, C. D. An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages. Nicholas J. Belkin, Peter Ingwersen, Mun-Kew Leong. In Proc. of the 23rd Annual International ACM SIGR Conference on Research and Development in Information Retrieval, Athens, Greece, 2000. ACM, (2000) 160–167
Harris Drucker, Vladimir N. Vapnik, IEEE TRANSACTION ON NETWORK, VOL. 10. NO. 5, SEPTEMBER 1999, Support Vector Machines for Spam Categorization
Jason D. M. Rennie. ifile: An Application of Machine Learning to E-Mail Filtering. M. Grobelnik, D. Mladeni c, and N. Milic-Frayling. In Proc. KDD-2000 Text Mining Workshop, Boston, MA, USA, 2000. University of Alberta, 2000.
Robert E. Schapire. Drifting games. Machine Learning. (June 2001) 43(3):265–291
Thorsten Joachims. Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers, (2001) 21–22
T. M. Mitchell. Machine Learning. McGraw-Hill, 1997
Wenbin Li, Ning zhong and Chunian Liu. Design and Implementation of an Email Classifier. 2nd International Conference on Active Media Technology, May 29–31, 2003. ACTIVE MEDAIA TECHNOLOGY, World Science, (2003) 423–430
Y. Diao, H. Lu, and D. Wu. A Comparative Study of Classification Based Personal Email Filtering. Takao Terano, Huan Liu, Arbee L. P. Chen. In Proc. PAKDD-2000, Kyoto, Japan, 2000. Springer, (2000) 408–419
Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proc. 14th International Conference on Machine Learning (ICML-97), 412C420
Zhong, N., Matsunaga, T., Liu, C. A Text Mining Agents Based Architecture for Personal E-mail Filtering and Management, Proc. Third International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2002), LNCS 2412, Springer, 337–346.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 International Federation for Information Processing
About this paper
Cite this paper
Li, W., Liu, C., Chen, Y. (2005). Design and Implement Cost-Sensitive Email Filtering Algorithms. In: Li, D., Wang, B. (eds) Artificial Intelligence Applications and Innovations. AIAI 2005. IFIP — The International Federation for Information Processing, vol 187. Springer, Boston, MA. https://doi.org/10.1007/0-387-29295-0_35
Download citation
DOI: https://doi.org/10.1007/0-387-29295-0_35
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-28318-0
Online ISBN: 978-0-387-29295-3
eBook Packages: Computer ScienceComputer Science (R0)