Abstract
Clearly, machine learning techniques can play an important role in filtering spam email because ample training data is available to build a robust classifier. However, spam filtering is a particularly challenging task as the data distribution and concept being learned changes over time. This is a particularly awkward form of concept drift as the change is driven by spammers wishing to circumvent the spam filters. In this paper we show that lazy learning techniques are appropriate for such dynamically changing contexts. We present a case-based system for spam filtering called ECUE that can learn dynamically. We evaluate its performance as the case-base is updated with new cases. We also explore the benefit of periodically redoing the feature selection process to bring new features into play. Our evaluation shows that these two levels of model update are effective in tracking concept drift.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Spira J. Spam E-Mail and its Impact on IT Spending and Productivity, Basex Report 2003, http://www.basex.com/poty2003.nsfl
Lenz M, Auriol E, Manago M. Diagnosis and Decision Support. In: M. Bartsch-Sporl, H. D. B., and Wess, S. (eds) Case-Based Reasoning Technology: From Foundations to Applications, Springer-Verlag, 1998 LNCS 104
Androutsopoulos I, Paliouras G, Michelakis E. Learning to Filter Unsolicited Commercial E-Mail. Tech rpt 2004/2, 2004, NCSR “Demokritos”, http://www.iit.demokritos.gr/skel/iconfig/publications/
Androutsopoulos I, Koutsias J, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos C, Stamatopoulos, P. Learning to filter spam e-mail: A comparison of a naive Bayesian and a memory-based approach. In: 4th PKDD Workshop on Machine Learning and Textual Information Access. 2000
Pantel P, Lin D. SpamCop: A spam classification and organization program. In: Learning for Text Categorization—Papers from the AAAI Workshop, Madison Wisconsin, 1998, 95–98. AAAI Technical Report WS-98-05
Sahami M, Dumais S, Heckerman D, Horvitz E. A Bayesian Approach to Filtering Junk Email. In: AAAI-98 Workshop on Learning for Text Categorization. Madison, Wisconsin. 1998, 55–62, AAAI Technical Report WS-98-05.
Androutsopoulos I, Koutsias J, Konstantinos V, Chandrinos V, Paliouras G, Spyropoulos C. An evaluation of Naive Bayesian anti-spam filtering, In: Potamias G, Moustakis V, van Someren M (eds.) Proc. of the ECML 2000 Workshop on Machine Learning in the New Information Age, 2000, 9–17
Drucker HD, Wu D, Vapnik V. Support vector machines for spam categorization. IEEE Transactions On Neural Networks, 1999 10(5) 1048–1054
Kolcz A, Alspector J. SVM-based filtering of e-mail spam with content-specific misclassification costs. In: Proc. of TextDM’2001, IEEE ICDM-2001 Workshop on Text Mining, San Jose CA 2001.
Gee K.R. Using Latent Semantic Indexing to Filter Spam. In: Proc. of the 2003 ACM Symposium on Applied Computing (SAC), ACM, 2003, 460–464
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P. A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 2004 6(1) 49–73
Carreras X, Marquez L. Boosting trees for anti-spam email filtering. In: Proc. 4th Int. Conf. on Recent Advances in Natural Language Processing 2001 Tzigov Chark, Bulgaria.
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P. Stacking classifiers for anti-spam filtering of e-mail. In: (ed) Lee & Harman, Proc. of 6th Conf. on Empirical Methods in Natural Language Processing 2001, 44–50
Widmer G, Kubat M. Learning in the presence of concept drift and hidden contexts, Machine Learning 1996 23(1) 69–101
Stanley K.O. Learning concept drift with a committee of decision trees, Tech. Report UT-AI-TR-03-302, Dept of Computer Sciences, University of Texas at Austin, USA, 2003
Widmer G, Kubat M. Effective learning in dynamic environments by explicit context tracking, In: Proc. ECML 1993, Springer-Verlag, LNCS 667, 1993, 227–243
Kubat M, Widmer G. Adapting to drift in continuous domains, Tech. Report Ă–FAI-TR-94-27, Austrian Research Institute for Artificial Intelligence, Vienna, 1994
Salganicoff M. Tolerating concept and sampling shift in lazy learning using prediction error context switching, AI Review, Spec. Iss. on Lazy Learning, 1997 11(1–5) 133–155
Klinkenberg R. Learning drifting concepts: example selection vs. example weighting. Intelligent Data Analysis, Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift, 2004 8(3) (to appear)
Cunningham P, Nowlan N, Delany SJ, Haahr M. A Case-Based Approach to Spam Filtering that Can Track Concept Drift. The ICCBR’03 Workshop on Long-Lived CBR Systems, Trondheim, Norway, 2003
Schlimmer JC, Granger RH. Incremental learning from noisy data, Machine Learning, 1986 1(3):317–354
Harries M., Sammut C, Horn K., Extracting hidden context, Machine Learning, 32(2), 1998, 101–126.
Street W, Kim Y. A streaming ensemble algorithm (SEA) for large-scale classification, Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining KDD-2001, ACM Press, 2001, 377–382
Wang H, Fan W, Yu PS, Han J. Mining concept-drifting data streams using ensemble classifiers. In: Proc. 9th ACM SIGKDD Int. Conf on Knowledge Discovery and Data Mining KDD-2003, ACM Press, 2003, 226–235
Kolter JZ, Maloof MA. Dynamic weighted majority: a new ensemble method for tracking concept drift. In: Procs. 3rd IEEE Int. Conf. on Data Mining, IEEE CS Press, 2003, 123–130
Hulten G, Spencer L, Domingos P. Mining time-changing data streams. In: Proc. 7th Int. Conf. on Knowledge Discovery and Data Mining, ACM Press, 2001, 97–106.
Aha DW, Kibler D, Albert MK. Instance-Based Learning Algorithms. Machine Learning, 1991 6:37–66
Quinlan J Ross. C4.5 Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA, 1993.
Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, 1997, 412–420
Delany SJ, Cunningham P. An Analysis of Case-Based Editing in a Spam Filtering System, In: Proc. of 7th European Conf. in Case-Based Reasoning, ECCBR-04, Springer Verlag, 2004 (to appear)
Delany SJ, Cunningham P, Coyle L. An Assessment of Case-base Reasoning for Spam Filtering. In: Working papers of 15th Artificial Intelligence and Cognitive Science Conference (AICS 2004), 2004
Lewis D, Ringuette M. Comparison of two learning algorithms for text categorization, In: SDAIR, (1994)81–93.
Niblett. Constructing decision trees in noisy domains. In: Proceedings of the Second European Working Session on Learning, Sigma, 1987, 67–78.
Kohavi R, Becker B, Sommerfield D. Improving Simple Bayes. In: ECML-97 Proceedings of the Ninth European Conference on Machine Learning. 1997
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag London Limited
About this paper
Cite this paper
Delany, S.J., Cunningham, P., Tsymbal, A., Coyle, L. (2005). A Case-Based Technique for Tracking Concept Drift in Spam Filtering. In: Macintosh, A., Ellis, R., Allen, T. (eds) Applications and Innovations in Intelligent Systems XII. SGAI 2004. Springer, London. https://doi.org/10.1007/1-84628-103-2_1
Download citation
DOI: https://doi.org/10.1007/1-84628-103-2_1
Publisher Name: Springer, London
Print ISBN: 978-1-85233-908-1
Online ISBN: 978-1-84628-103-7
eBook Packages: Computer ScienceComputer Science (R0)