Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages

Güngör, Tunga; Çıltık, Ali

doi:10.1007/978-3-540-73351-5_4

Tunga Güngör¹ &
Ali Çıltık¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4592))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

971 Accesses

Abstract

In this paper, we propose methods and heuristics having high accuracies and low time complexities for filtering spam e-mails. The methods are based on the n-gram approach and a heuristics which is referred to as the first n-words heuristics is devised. Though the main concern of the research is studying the applicability of these methods on Turkish e-mails, they were also applied to English e-mails. A data set for both languages was compiled. Extensive tests were performed with different parameters. Success rates of about 97% for Turkish e-mails and above 98% for English e-mails were obtained. In addition, it has been shown that the time complexities can be reduced significantly without sacrificing from success.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Burns, E.: New Image-Based Spam: No Two Alike, http://www.clickz.com/
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C.: An Evaluation of Naive Bayesian Anti-Spam Filtering. In: Machine Learning in the New Information Age. Barcelona, pp. 9–17 (2000)
Google Scholar
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk E-Mail. In: AAAI Workshop on Learning for Text Categorization. Madison, pp. 55–62 (1998)
Google Scholar
Schneider, K.M.: A Comparison of Event Models for Naïve Bayes Anti-Spam E-Mail Filtering. In: Conference of the European Chapter of ACL. Budapest, pp. 307–314 (2003)
Google Scholar
Cohen, W.: Learning Rules That Classify E-mail. In: AAAI Spring Symposium on Machine Learning in Information Access. Stanford, California, pp. 18–25 (1996)
Google Scholar
Drucker, H., Wu, D., Vapnik, V.N.: Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)
Article Google Scholar
Kolcz, A., Alspector, J.: SVM-Based Filtering of E-Mail Spam with Content-Specific Misclassification Costs. In: TextDM Workshop on Text Mining (2001)
Google Scholar
Delany, S.J., Cunningham, P., Tsymbal, A., Coyle, L.: A Case-Based Technique for Tracking Concept Drift in Spam Filtering. Knowledge-Based Systems 18, 187–195 (2005)
Article Google Scholar
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to Filter Spam E-Mail: A Comparison of a Naïve Bayesian and a Memory-Based Approach. In: Workshop on Machine Learning and Textual Information Access, Lyon, pp. 1–13 (2000)
Google Scholar
Zhang, L., Yao, T.: Filtering Junk Mail with a Maximum Entropy Model. In: International Conference on Computer Processing of Oriental Languages, pp. 446–453 (2003)
Google Scholar
http://www.faqs.org/rfcs/rfc2554.html/
http://www.openspf.org/
Özgür, L., Güngör, T., Gürgen, F.: Adaptive Anti-Spam Filtering for Agglutinative Languages:A Special Case for Turkish. Pattern Recognition Letters 25(16), 1819–1831 (2004)
Article Google Scholar
Oflazer, K.: Two-Level Description of Turkish Morphology. Literary and Linguistic Computing 9(2), 137–148 (1994)
Article Google Scholar
Charniak, E.: Statistical Language Learning. MIT, Cambridge, MA (1997)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT, Cambridge, MA (2000)
Google Scholar
Zdziarski, J.: Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification. No Starch Press (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Boğaziçi University, Computer Engineering Department, Bebek, 34342 İstanbul, Turkey
Tunga Güngör & Ali Çıltık

Authors

Tunga Güngör
View author publications
You can also search for this author in PubMed Google Scholar
Ali Çıltık
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Zoubida Kedad Nadira Lammari Elisabeth Métais Farid Meziane Yacine Rezgui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Güngör, T., Çıltık, A. (2007). Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages. In: Kedad, Z., Lammari, N., Métais, E., Meziane, F., Rezgui, Y. (eds) Natural Language Processing and Information Systems. NLDB 2007. Lecture Notes in Computer Science, vol 4592. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73351-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-540-73351-5_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73350-8
Online ISBN: 978-3-540-73351-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics