Skip to main content

Combining Naive Bayes and n-Gram Language Models for Text Classification

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2633))

Included in the following conference series:

Abstract

We augment the naive Bayes model with an n-gram language model to address two shortcomings of naive Bayes text classifiers. The chain augmented naive Bayes classifiers we propose have two advantages over standard naive Bayes classifiers. First, a chain augmented naive Bayes model relaxes some of the independence assumptions of naive Bayes— allowing a local Markov chain dependence in the observed variables—while still permitting efficient inference and learning. Second, smoothing techniques from statistical language modeling can be used to recover better estimates than the Laplace smoothing techniques usually used in naive Bayes classification. Our experimental results on three real world data sets show that we achieve substantial improvements over standard naive Bayes classification, while also achieving state of the art performance that competes with the best known methods in these cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. T. Bell, J. Cleary and I. Witten. (1990). Text Compression. Prentice Hall.

    Google Scholar 

  2. S. Chen and J. Goodman. (1998). An Empirical Study of Smoothing Techniques for Language Modeling. Technical report, TR-10-98, Harvard University.

    Google Scholar 

  3. W. Cavnar, J. Trenkle. (1994). N-Gram-Based Text Categorization. In Proceedings of SDAIR-94.

    Google Scholar 

  4. P. Domingos and M. Pazzani. (1997). Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. Machine Learning, 29, 103–130

    Article  MATH  Google Scholar 

  5. R. Duda and P. Hart. (1973). Pattern Classification and Scene Analysis. Wiley, NY.

    MATH  Google Scholar 

  6. S. Eyheramendy, D. Lewis and D. Madigan. (2003). On the Naive Bayes Model for Text Categorization. To appear in Artificial Intelligence & Statistics 2003.

    Google Scholar 

  7. N. Friedman, D. Geiger, and M. Goldszmidt. (1997). Bayesian Network Classifiers. In Machine Learning 29:131–163.

    Article  MATH  Google Scholar 

  8. J. He, A. Tan, and C. Tan. (2000). A Comparative Study on Chinese Text Categorization Methods. In Proceedings of PRICAI’2000 International Workshop on Text and Web Mining, p24–35.

    Google Scholar 

  9. D. Hiemstra. (2001). Using Language Models for Information Retrieval. Ph.D. Thesis, Centre for Telematics and Information Technology, University of Twente.

    Google Scholar 

  10. E. Keogh and M. Pazzanni. (1999). Learning Augmented Bayesian Classifiers: A Comparison of Distribution-based and Classification-based Approaches. In Artificial Intelligence & Statistics 1999

    Google Scholar 

  11. K. Kwok. (1999). Employing Multiple Representations for Chinese Information Retrieval, JASIS, 50(8), 709–723.

    Article  Google Scholar 

  12. D. Lewis. (1998). Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In Proceedings ECML-98.

    Google Scholar 

  13. C. Manning, and H. SchĂĽtze. (1999). Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, Massachusetts.

    MATH  Google Scholar 

  14. A. McCallum and K. Nigam. (1998). A Comparison of Event Models for Naive Bayes Text Classification. In Proceedings of AAAI-98 Workshop on “Learning for Text Categorization”, AAAI Presss.

    Google Scholar 

  15. H. Ney, U. Essen, and R. Kneser. (1994). On Structuring Probabilistic Dependencies in Stochastic Language Modeling. In Comput. Speech and Lang., 8(1), 1–28.

    Article  Google Scholar 

  16. M. Pazzani and D. Billsus. (1997). Learning and Revising User Profiles: The identification of interesting web sites. Machine Learning, 27, 313–331.

    Article  Google Scholar 

  17. J. Ponte, W. Croft. (1998). A Language Modeling Approach to Information Retrieval. In Proceedings of SIGIR1998, 275–281.

    Google Scholar 

  18. J. Rennie. (2001). Improving Multi-class Text Classification with Naive Bayes. Master’s Thesis. M. I. T. AI Technical Report AITR-2001-004. 2001.

    Google Scholar 

  19. I. Rish. (2001). An Empirical Study of the Naive Bayes Classifier. In Proceedings of IJCAI-01 Workshop on Empirical Methods in Artificial Intelligence.

    Google Scholar 

  20. S. Robertson and K. Sparck Jones. (1976). Relevance Weighting of Search Terms. JASIS, 27, 129–146.

    Article  Google Scholar 

  21. S. Scott and S. Matwin. (1999). Feature Engineering for Text Classification. In Proceedings of ICML’99, pp. 379–388.

    Google Scholar 

  22. F. Sebastiani. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1–47, 2002.

    Article  Google Scholar 

  23. E. Stamatatos, N. Fakotakis and G. Kokkinakis. (2000). Automatic Text Categorization in Terms of Genre and Author. Comput. Ling., 26(4), pp. 471–495.

    Article  Google Scholar 

  24. W. Teahan and D. Harper. (2001). Using Compression-Based Language Models for Text Categorization. In Proceedings of Workshop on LMIR.

    Google Scholar 

  25. A. Turpin and A. Moffat. (1999). Statistical Phrases for Vector-Space Information Retrieval. Proceedings of SIGIR 1999, pp. 309–310.

    Google Scholar 

  26. Y. Yang. (1999). An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval, Vol. 1, No. 1/2, pp. 67–88.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Peng, F., Schuurmans, D. (2003). Combining Naive Bayes and n-Gram Language Models for Text Classification. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_24

Download citation

  • DOI: https://doi.org/10.1007/3-540-36618-0_24

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-01274-0

  • Online ISBN: 978-3-540-36618-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics