Abstract
In this paper, we describe the application of the Desicion Tree Boosting (DTB) learning model to spam email filtering.This classification task implies the learning in a high dimensional feature space. So, it is an example of how the DTB algorithm performs in such feature space problems. In [1], it has been shown that hypotheses computed by the DTB model are more comprehensible that the ones computed by another ensemble methods. Hence, this paper tries to show that the DTB algorithm maintains the same comprehensibility of hypothesis in high dimensional feature space problems while achieving the performance of other ensemble methods. Four traditional evaluation measures (precision, recall, F1 and accuracy) have been considered for performance comparison between DTB and others models usually applied to spam email filtering. The size of the hypothesis computed by a DTB is smaller and more comprehensible than the hypothesis computed by Adaboost and Naïve Bayes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Triviño-Rodriguez, J.L., Ruiz-Sepúlveda, A., Morales-Bueno, R.: How an ensemble method can compute a comprehensible model. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) 10th International Conference on Data Warehousing and Knowledge Discovery. LNCS, vol. 5182, pp. 268–378. Springer, Heidelberg (2008)
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. In: ACM Transactions on Information Systems, pp. 307–315. ACM Press, New York (1996)
Ruiz, M.E., Srinivasan, P.: Hierarchical neural networks for text categorization. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 281–282 (1999)
Larkey, L.S., Croft, W.B.: Combining classifiers in text categorization. In: SIGIR 1996: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 289–297. ACM, New York (1996)
Schapire, R.E., Singer, Y., Singhal, A.: Boosting and rocchio applied to text filtering. In: Proceedings of ACM SIGIR, pp. 215–223. ACM Press, New York (1998)
Drucker, H., Member, S., Wu, D., Member, S., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10, 1048–1054 (1999)
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C.D., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. CoRR cs.CL/0009009 (2000)
Apte, C., Damerau, F., Weiss, S.M., Apte, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems 12, 233–251 (1994)
Spamassasin
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: A memory-based approach to anti-spam filtering (2001)
Carreras, X., Marquez, L.S., Salgado, J.G.: Boosting trees for anti-spam email filtering. In: Proceedings of RANLP 2001, 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, 58–64 (2001)
Guzella, T.S., Caminhas, W.M.: A review of machine learning approaches to spam filtering. Expert Systems with Application. Corrected Proof (in press, 2009)
Quinlan, J.: Bagging, boosting, and c4.5. In: Proc. of the 13th Nat. Conf. on A.I. and the 8th Innovate Applications of A.I. Conf., pp. 725–730. AAAI/MIT Press (1996)
Tretyakov, K.: Machine learning techniques in spam filtering. Technical report, Institute of Computer Science, University of Tartu (2004)
Freund, Y., Mason, L.: The alternating decision tree learning algorithm. In: Proc. 16th International Conf. on Machine Learning, pp. 124–133. Morgan Kaufmann, San Francisco (1999)
Méndez, J.R., Fdez-Riverola, F., Díaz, F., Iglesias, E.L., Corchado, J.M.: A comparative performance study of feature selection methods for the anti-spam filtering domain. In: Perner, P., Heidelberg, S.B. (eds.) ICDM 2006. LNCS (LNAI), vol. 4065, pp. 106–120. Springer, Heidelberg (2006)
Chen, C., Gong, Y., Bie, R., Gao, X.: Searching for interacting features for spam filtering. In: Sun, F., Zhang, J., Tan, Y., Cao, J., Yu, W. (eds.) ISNN 2008, Part I. LNCS, vol. 5263, pp. 491–500. Springer, Heidelberg (2008)
Kearns, M., Mansour, Y.: On the boosting ability of top-down decision tree learning algorithms. In: Twenty-eighth annual ACM symposium on Theory of computing, Philadelphia, Pennsylvania, United States, pp. 459–468 (1996)
Schapire, R., Freund, Y.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Second European Conference on Computational Learning Theory, pp. 23–37. Springer, Heidelberg (1995)
Dietterich, T., Kearns, M., Mansour, Y.: Applying the weak learning framework to understand and improve C4.5. In: Proc. 13th International Conference on Machine Learning, pp. 96–104. Morgan Kaufmann, San Francisco (1996)
Wolpert, D.H.: Stacked generalization. Neural Networks 5(2), 241–259 (1992)
Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Witten, I., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ruiz-Sepúlveda, A., Triviño-Rodriguez, J.L., Morales-Bueno, R. (2009). Computing a Comprehensible Model for Spam Filtering. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds) Discovery Science. DS 2009. Lecture Notes in Computer Science(), vol 5808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04747-3_39
Download citation
DOI: https://doi.org/10.1007/978-3-642-04747-3_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04746-6
Online ISBN: 978-3-642-04747-3
eBook Packages: Computer ScienceComputer Science (R0)