Tracking Concept Drift at Feature Selection Stage in SpamHunting: An Anti-spam Instance-Based Reasoning System

Méndez, J. R.; Fdez-Riverola, F.; Iglesias, E. L.; Díaz, F.; Corchado, J. M.

doi:10.1007/11805816_37

J. R. Méndez²¹,
F. Fdez-Riverola²¹,
E. L. Iglesias²¹,
F. Díaz²² &
…
J. M. Corchado²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4106))

Included in the following conference series:

European Conference on Case-Based Reasoning

835 Accesses
20 Citations

Abstract

In this paper we propose a novel feature selection method able to handle concept drift problems in spam filtering domain. The proposed technique is applied to a previous successful instance-based reasoning e-mail filtering system called SpamHunting. Our achieved information criterion is based on several ideas extracted from the well-known information measure introduced by Shannon. We show how results obtained by our previous system in combination with the improved feature selection method outperforms classical machine learning techniques and other well-known lazy learning approaches. In order to evaluate the performance of all the analysed models, we employ two different corpus and six well-known metrics in various scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Oard, D.W.: The state of the art in text filtering. User Modeling and User-Adapted Interaction 7, 141–178 (1997)
Article Google Scholar
Wittel, G.L., Wu, S.F.: On Attacking Statistical Spam Filters. In: Proc. of the First Conference on E-mail and Anti-Spam CEAS (2004)
Google Scholar
Androutsopoulos, I., Paliouras, G., Michelakis, E.: Learning to Filter Unsolicited Commercial E-Mail. Technical Report 2004/2, NCSR Demokritos (2004)
Google Scholar
Delany, S.J., Cunningham, P., Coyle, L.: An Assessment of Case-base Reasoning for Spam Filtering. In: Proc. of Fifteenth Irish Conference on Artificial Intelligence and Cognitive Science: AICS 2004, pp. 9–18 (2004)
Google Scholar
Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A Case-Based Approach to Spam Filtering that Can Track Concept Drift. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689. Springer, Heidelberg (2003)
Chapter Google Scholar
Fdez-Riverola, F., Iglesias, E.L., Díaz, F., Méndez, J.R., Corchado, J.M.: SpamHunting: An Instance-Based Reasoning System for Spam Labelling and Filtering. In: Decision Support Systems (to appear, 2006)
Google Scholar
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization – Papers from the AAAI Workshop, Technical Report WS-98-05, pp. 55–62 (1998)
Google Scholar
Carreras, X., Màrquez, L.: Boosting trees for anti-spam e-mail filtering. In: Proc. of the 4th International Conference on Recent Advances in Natural Language Processing, pp. 58–64 (2001)
Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Statistics for Engineering and Information Science (1999)
Google Scholar
Lee, H., Ng, A.Y.: Spam Deobfuscation using a Hidden Markov Model. In: Proc. of the Second Conference on E-mail and Anti-Spam CEAS (2005)
Google Scholar
Druker, H., Vapmik, V.: Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)
Article Google Scholar
Platt, J.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Sholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999)
Google Scholar
Schapire, R.E., Singer, Y.: BoosTexter: a boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)
Article MATH Google Scholar
Rigoutsos, I., Huynh, T.: Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages (SPAM). In: Proc. of the First Conference on E-mail and Anti-Spam CEAS (2004)
Google Scholar
Graham, P.: Better Bayesian filtering. In: Proc. of the MIT Spam Conference (2003)
Google Scholar
Hovold, J.: Naïve Bayes Spam Filtering Using Word-Position-Based Attributes. In: Proc. of the Second Conference on Email and Anti-Spam CEAS (2005), http://www.ceas.cc/papers-2005/144.pdf
Kolcz, A., Alspector, J.: SVM-based filtering of e-mail spam with content specific misclassification costs. In: Proc. of the ICDM Workshop on Text Mining (2001)
Google Scholar
Gama, J., Castillo, G.: Adaptive Bayes. In: Garijo, F.J., Riquelme, J.-C., Toro, M. (eds.) IBERAMIA 2002. LNCS, vol. 2527, pp. 765–774. Springer, Heidelberg (2002)
Chapter Google Scholar
Scholz, M., Klinkenberg, R.: An Ensemble Classifier for Drifting Concepts. In: Proc. of the Second International Workshop on Knowledge Discovery from Data Streams, pp. 53–64 (2005)
Google Scholar
Syed, N.A., Liu, H., Sung, K.K.: Handling Concept Drifts in Incremental Learning with Support Vector Machines. In: Proc. of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp. 317–321 (1999)
Google Scholar
Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Machine Learning 23(1), 69–101 (1996)
Google Scholar
Lenz, M., Auriol, E., Manago, M.: Diagnosis and Decision Support. In: Lenz, M., Bartsch-Spörl, B., Burkhard, H.-D., Wess, S. (eds.) Case-Based Reasoning Technology. LNCS (LNAI), vol. 1400, pp. 51–90. Springer, Heidelberg (1998)
Chapter Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. of the Fourteenth International Conference on Machine Learning ICML 1997, pp. 412–420 (1997)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, Reading (1999)
Google Scholar
Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Analyzing the Impact of Corpus Preprocessing on Anti-Spam Filtering Software. Research on Computing Science 17, 129–138 (2005)
Google Scholar
Shannon, C.E.: The mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423, 623–656 (1997)
MathSciNet Google Scholar
Salton, G., McGill, M.: Introduction to mosdern information retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence IJCAI 1995, pp. 1137–1143 (1995)
Google Scholar
Oliver, J.J., Hand, D.J.: Averaging over decision stumps. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 231–241. Springer, Heidelberg (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. Informática, University of Vigo, Escuela Superior de Ingeniería Informática, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
J. R. Méndez, F. Fdez-Riverola & E. L. Iglesias
Dept. Informática, University of Valladolid, Escuela Universitaria de Informática, Plaza Santa Eulalia, 9-11, 40005, Segovia, Spain
F. Díaz
Dept. Informática y Automática, University of Salamanca, Plaza de la Merced s/n, 37008, Salamanca, Spain
J. M. Corchado

Authors

J. R. Méndez
View author publications
You can also search for this author in PubMed Google Scholar
F. Fdez-Riverola
View author publications
You can also search for this author in PubMed Google Scholar
E. L. Iglesias
View author publications
You can also search for this author in PubMed Google Scholar
F. Díaz
View author publications
You can also search for this author in PubMed Google Scholar
J. M. Corchado
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Knowledge Management Department, German Research Center for Artificial Intelligence (DFKI) GmbH, Trippstadter Straße 122, 67663, Kaiserslautern, Germany
Thomas R. Roth-Berghofer
PricewaterhouseCoopers LLP, Center for Advanced Research, Ten Almaden Blvd, Suite 1600, 95113, San Jose, CA
Mehmet H. Göker
Department of Computer Engineering, Bilkent University, 06800, Ankara, Turkey
H. Altay Güvenir

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Méndez, J.R., Fdez-Riverola, F., Iglesias, E.L., Díaz, F., Corchado, J.M. (2006). Tracking Concept Drift at Feature Selection Stage in SpamHunting: An Anti-spam Instance-Based Reasoning System. In: Roth-Berghofer, T.R., Göker, M.H., Güvenir, H.A. (eds) Advances in Case-Based Reasoning. ECCBR 2006. Lecture Notes in Computer Science(), vol 4106. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11805816_37

Download citation

DOI: https://doi.org/10.1007/11805816_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-36843-4
Online ISBN: 978-3-540-36846-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics