Computing a Comprehensible Model for Spam Filtering

Ruiz-Sepúlveda, Amparo; Triviño-Rodriguez, José L.; Morales-Bueno, Rafael

doi:10.1007/978-3-642-04747-3_39

Amparo Ruiz-Sepúlveda²³,
José L. Triviño-Rodriguez²³ &
Rafael Morales-Bueno²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5808))

Included in the following conference series:

International Conference on Discovery Science

1911 Accesses

Abstract

In this paper, we describe the application of the Desicion Tree Boosting (DTB) learning model to spam email filtering.This classification task implies the learning in a high dimensional feature space. So, it is an example of how the DTB algorithm performs in such feature space problems. In [1], it has been shown that hypotheses computed by the DTB model are more comprehensible that the ones computed by another ensemble methods. Hence, this paper tries to show that the DTB algorithm maintains the same comprehensibility of hypothesis in high dimensional feature space problems while achieving the performance of other ensemble methods. Four traditional evaluation measures (precision, recall, F1 and accuracy) have been considered for performance comparison between DTB and others models usually applied to spam email filtering. The size of the hypothesis computed by a DTB is smaller and more comprehensible than the hypothesis computed by Adaboost and Naïve Bayes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Triviño-Rodriguez, J.L., Ruiz-Sepúlveda, A., Morales-Bueno, R.: How an ensemble method can compute a comprehensible model. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) 10th International Conference on Data Warehousing and Knowledge Discovery. LNCS, vol. 5182, pp. 268–378. Springer, Heidelberg (2008)
Google Scholar
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. In: ACM Transactions on Information Systems, pp. 307–315. ACM Press, New York (1996)
Google Scholar
Ruiz, M.E., Srinivasan, P.: Hierarchical neural networks for text categorization. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 281–282 (1999)
Google Scholar
Larkey, L.S., Croft, W.B.: Combining classifiers in text categorization. In: SIGIR 1996: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 289–297. ACM, New York (1996)
Chapter Google Scholar
Schapire, R.E., Singer, Y., Singhal, A.: Boosting and rocchio applied to text filtering. In: Proceedings of ACM SIGIR, pp. 215–223. ACM Press, New York (1998)
Google Scholar
Drucker, H., Member, S., Wu, D., Member, S., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10, 1048–1054 (1999)
Article Google Scholar
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C.D., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. CoRR cs.CL/0009009 (2000)
Google Scholar
Apte, C., Damerau, F., Weiss, S.M., Apte, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems 12, 233–251 (1994)
Article Google Scholar
Spamassasin
Google Scholar
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: A memory-based approach to anti-spam filtering (2001)
Google Scholar
Carreras, X., Marquez, L.S., Salgado, J.G.: Boosting trees for anti-spam email filtering. In: Proceedings of RANLP 2001, 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, 58–64 (2001)
Google Scholar
Guzella, T.S., Caminhas, W.M.: A review of machine learning approaches to spam filtering. Expert Systems with Application. Corrected Proof (in press, 2009)
Google Scholar
Quinlan, J.: Bagging, boosting, and c4.5. In: Proc. of the 13th Nat. Conf. on A.I. and the 8th Innovate Applications of A.I. Conf., pp. 725–730. AAAI/MIT Press (1996)
Google Scholar
Tretyakov, K.: Machine learning techniques in spam filtering. Technical report, Institute of Computer Science, University of Tartu (2004)
Google Scholar
Freund, Y., Mason, L.: The alternating decision tree learning algorithm. In: Proc. 16th International Conf. on Machine Learning, pp. 124–133. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Méndez, J.R., Fdez-Riverola, F., Díaz, F., Iglesias, E.L., Corchado, J.M.: A comparative performance study of feature selection methods for the anti-spam filtering domain. In: Perner, P., Heidelberg, S.B. (eds.) ICDM 2006. LNCS (LNAI), vol. 4065, pp. 106–120. Springer, Heidelberg (2006)
Chapter Google Scholar
Chen, C., Gong, Y., Bie, R., Gao, X.: Searching for interacting features for spam filtering. In: Sun, F., Zhang, J., Tan, Y., Cao, J., Yu, W. (eds.) ISNN 2008, Part I. LNCS, vol. 5263, pp. 491–500. Springer, Heidelberg (2008)
Chapter Google Scholar
Kearns, M., Mansour, Y.: On the boosting ability of top-down decision tree learning algorithms. In: Twenty-eighth annual ACM symposium on Theory of computing, Philadelphia, Pennsylvania, United States, pp. 459–468 (1996)
Google Scholar
Schapire, R., Freund, Y.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Second European Conference on Computational Learning Theory, pp. 23–37. Springer, Heidelberg (1995)
Google Scholar
Dietterich, T., Kearns, M., Mansour, Y.: Applying the weak learning framework to understand and improve C4.5. In: Proc. 13th International Conference on Machine Learning, pp. 96–104. Morgan Kaufmann, San Francisco (1996)
Google Scholar
Wolpert, D.H.: Stacked generalization. Neural Networks 5(2), 241–259 (1992)
Article Google Scholar
Quinlan, J.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Witten, I., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Artificial Intelligence, University of Málaga, Málaga, Spain
Amparo Ruiz-Sepúlveda, José L. Triviño-Rodriguez & Rafael Morales-Bueno

Authors

Amparo Ruiz-Sepúlveda
View author publications
You can also search for this author in PubMed Google Scholar
José L. Triviño-Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Morales-Bueno
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Economics; Rua Dr. Roberto Frias, University of Porto, 4200-465, Porto, Portugal
João Gama
DCC-FC, Universidade do Porto, Portugal
Vítor Santos Costa
LIACC/FEP, Universidade do Porto, Portugal
Alípio Mário Jorge
LIAAD-INESC Porto L.A./Faculty of Economics, University of Porto, Rua de Ceuta, 118-6, 4050-190, Porto, Portugal
Pavel B. Brazdil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ruiz-Sepúlveda, A., Triviño-Rodriguez, J.L., Morales-Bueno, R. (2009). Computing a Comprehensible Model for Spam Filtering. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds) Discovery Science. DS 2009. Lecture Notes in Computer Science(), vol 5808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04747-3_39

Download citation

DOI: https://doi.org/10.1007/978-3-642-04747-3_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04746-6
Online ISBN: 978-3-642-04747-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics