Combining Naive Bayes and n-Gram Language Models for Text Classification

Peng, Fuchun; Schuurmans, Dale

doi:10.1007/3-540-36618-0_24

Fuchun Peng⁵ &
Dale Schuurmans⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2633))

Included in the following conference series:

European Conference on Information Retrieval

2148 Accesses
65 Citations

Abstract

We augment the naive Bayes model with an n-gram language model to address two shortcomings of naive Bayes text classifiers. The chain augmented naive Bayes classifiers we propose have two advantages over standard naive Bayes classifiers. First, a chain augmented naive Bayes model relaxes some of the independence assumptions of naive Bayes— allowing a local Markov chain dependence in the observed variables—while still permitting efficient inference and learning. Second, smoothing techniques from statistical language modeling can be used to recover better estimates than the Laplace smoothing techniques usually used in naive Bayes classification. Our experimental results on three real world data sets show that we achieve substantial improvements over standard naive Bayes classification, while also achieving state of the art performance that competes with the best known methods in these cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

T. Bell, J. Cleary and I. Witten. (1990). Text Compression. Prentice Hall.
Google Scholar
S. Chen and J. Goodman. (1998). An Empirical Study of Smoothing Techniques for Language Modeling. Technical report, TR-10-98, Harvard University.
Google Scholar
W. Cavnar, J. Trenkle. (1994). N-Gram-Based Text Categorization. In Proceedings of SDAIR-94.
Google Scholar
P. Domingos and M. Pazzani. (1997). Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. Machine Learning, 29, 103–130
Article MATH Google Scholar
R. Duda and P. Hart. (1973). Pattern Classification and Scene Analysis. Wiley, NY.
MATH Google Scholar
S. Eyheramendy, D. Lewis and D. Madigan. (2003). On the Naive Bayes Model for Text Categorization. To appear in Artificial Intelligence & Statistics 2003.
Google Scholar
N. Friedman, D. Geiger, and M. Goldszmidt. (1997). Bayesian Network Classifiers. In Machine Learning 29:131–163.
Article MATH Google Scholar
J. He, A. Tan, and C. Tan. (2000). A Comparative Study on Chinese Text Categorization Methods. In Proceedings of PRICAI’2000 International Workshop on Text and Web Mining, p24–35.
Google Scholar
D. Hiemstra. (2001). Using Language Models for Information Retrieval. Ph.D. Thesis, Centre for Telematics and Information Technology, University of Twente.
Google Scholar
E. Keogh and M. Pazzanni. (1999). Learning Augmented Bayesian Classifiers: A Comparison of Distribution-based and Classification-based Approaches. In Artificial Intelligence & Statistics 1999
Google Scholar
K. Kwok. (1999). Employing Multiple Representations for Chinese Information Retrieval, JASIS, 50(8), 709–723.
Article Google Scholar
D. Lewis. (1998). Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In Proceedings ECML-98.
Google Scholar
C. Manning, and H. Schütze. (1999). Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, Massachusetts.
MATH Google Scholar
A. McCallum and K. Nigam. (1998). A Comparison of Event Models for Naive Bayes Text Classification. In Proceedings of AAAI-98 Workshop on “Learning for Text Categorization”, AAAI Presss.
Google Scholar
H. Ney, U. Essen, and R. Kneser. (1994). On Structuring Probabilistic Dependencies in Stochastic Language Modeling. In Comput. Speech and Lang., 8(1), 1–28.
Article Google Scholar
M. Pazzani and D. Billsus. (1997). Learning and Revising User Profiles: The identification of interesting web sites. Machine Learning, 27, 313–331.
Article Google Scholar
J. Ponte, W. Croft. (1998). A Language Modeling Approach to Information Retrieval. In Proceedings of SIGIR1998, 275–281.
Google Scholar
J. Rennie. (2001). Improving Multi-class Text Classification with Naive Bayes. Master’s Thesis. M. I. T. AI Technical Report AITR-2001-004. 2001.
Google Scholar
I. Rish. (2001). An Empirical Study of the Naive Bayes Classifier. In Proceedings of IJCAI-01 Workshop on Empirical Methods in Artificial Intelligence.
Google Scholar
S. Robertson and K. Sparck Jones. (1976). Relevance Weighting of Search Terms. JASIS, 27, 129–146.
Article Google Scholar
S. Scott and S. Matwin. (1999). Feature Engineering for Text Classification. In Proceedings of ICML’99, pp. 379–388.
Google Scholar
F. Sebastiani. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1–47, 2002.
Article Google Scholar
E. Stamatatos, N. Fakotakis and G. Kokkinakis. (2000). Automatic Text Categorization in Terms of Genre and Author. Comput. Ling., 26(4), pp. 471–495.
Article Google Scholar
W. Teahan and D. Harper. (2001). Using Compression-Based Language Models for Text Categorization. In Proceedings of Workshop on LMIR.
Google Scholar
A. Turpin and A. Moffat. (1999). Statistical Phrases for Vector-Space Information Retrieval. Proceedings of SIGIR 1999, pp. 309–310.
Google Scholar
Y. Yang. (1999). An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval, Vol. 1, No. 1/2, pp. 67–88.
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, University of Waterloo, 200 University Avenue West, Waterloo, Ontario, Canada, N2L 3G1
Fuchun Peng & Dale Schuurmans

Authors

Fuchun Peng
View author publications
You can also search for this author in PubMed Google Scholar
Dale Schuurmans
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Instituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Via Giuseppe Moruzzi, 1, 56124, Pisa, Italy
Fabrizio Sebastiani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peng, F., Schuurmans, D. (2003). Combining Naive Bayes and n-Gram Language Models for Text Classification. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_24

Download citation

DOI: https://doi.org/10.1007/3-540-36618-0_24
Published: 15 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-01274-0
Online ISBN: 978-3-540-36618-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics