Skip to main content

Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages

  • Conference paper
Natural Language Processing and Information Systems (NLDB 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4592))

  • 971 Accesses

Abstract

In this paper, we propose methods and heuristics having high accuracies and low time complexities for filtering spam e-mails. The methods are based on the n-gram approach and a heuristics which is referred to as the first n-words heuristics is devised. Though the main concern of the research is studying the applicability of these methods on Turkish e-mails, they were also applied to English e-mails. A data set for both languages was compiled. Extensive tests were performed with different parameters. Success rates of about 97% for Turkish e-mails and above 98% for English e-mails were obtained. In addition, it has been shown that the time complexities can be reduced significantly without sacrificing from success.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Burns, E.: New Image-Based Spam: No Two Alike, http://www.clickz.com/

  2. Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C.: An Evaluation of Naive Bayesian Anti-Spam Filtering. In: Machine Learning in the New Information Age. Barcelona, pp. 9–17 (2000)

    Google Scholar 

  3. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk E-Mail. In: AAAI Workshop on Learning for Text Categorization. Madison, pp. 55–62 (1998)

    Google Scholar 

  4. Schneider, K.M.: A Comparison of Event Models for Naïve Bayes Anti-Spam E-Mail Filtering. In: Conference of the European Chapter of ACL. Budapest, pp. 307–314 (2003)

    Google Scholar 

  5. Cohen, W.: Learning Rules That Classify E-mail. In: AAAI Spring Symposium on Machine Learning in Information Access. Stanford, California, pp. 18–25 (1996)

    Google Scholar 

  6. Drucker, H., Wu, D., Vapnik, V.N.: Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)

    Article  Google Scholar 

  7. Kolcz, A., Alspector, J.: SVM-Based Filtering of E-Mail Spam with Content-Specific Misclassification Costs. In: TextDM Workshop on Text Mining (2001)

    Google Scholar 

  8. Delany, S.J., Cunningham, P., Tsymbal, A., Coyle, L.: A Case-Based Technique for Tracking Concept Drift in Spam Filtering. Knowledge-Based Systems 18, 187–195 (2005)

    Article  Google Scholar 

  9. Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to Filter Spam E-Mail: A Comparison of a Naïve Bayesian and a Memory-Based Approach. In: Workshop on Machine Learning and Textual Information Access, Lyon, pp. 1–13 (2000)

    Google Scholar 

  10. Zhang, L., Yao, T.: Filtering Junk Mail with a Maximum Entropy Model. In: International Conference on Computer Processing of Oriental Languages, pp. 446–453 (2003)

    Google Scholar 

  11. http://www.faqs.org/rfcs/rfc2554.html/

  12. http://www.openspf.org/

  13. Özgür, L., Güngör, T., Gürgen, F.: Adaptive Anti-Spam Filtering for Agglutinative Languages:A Special Case for Turkish. Pattern Recognition Letters 25(16), 1819–1831 (2004)

    Article  Google Scholar 

  14. Oflazer, K.: Two-Level Description of Turkish Morphology. Literary and Linguistic Computing 9(2), 137–148 (1994)

    Article  Google Scholar 

  15. Charniak, E.: Statistical Language Learning. MIT, Cambridge, MA (1997)

    Google Scholar 

  16. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT, Cambridge, MA (2000)

    Google Scholar 

  17. Zdziarski, J.: Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification. No Starch Press (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Zoubida Kedad Nadira Lammari Elisabeth Métais Farid Meziane Yacine Rezgui

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Güngör, T., Çıltık, A. (2007). Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages. In: Kedad, Z., Lammari, N., Métais, E., Meziane, F., Rezgui, Y. (eds) Natural Language Processing and Information Systems. NLDB 2007. Lecture Notes in Computer Science, vol 4592. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73351-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-73351-5_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-73350-8

  • Online ISBN: 978-3-540-73351-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics