A Simple Probability Based Term Weighting Scheme for Automated Text Classification

Liu, Ying; Loh, Han Tong

doi:10.1007/978-3-540-73325-6_4

Ying Liu¹ &
Han Tong Loh²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4570))

Included in the following conference series:

International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems

1359 Accesses

Abstract

In the automated text classification, tfidf is often considered as the default term weighting scheme and has been widely reported in literature. However, tfidf does not directly reflect terms’ category membership. Inspired by the analysis of various feature selection methods, we propose a simple probability based term weighting scheme which directly utilizes two critical information ratios, i.e. relevance indicators. These relevance indicators are nicely supported by probability estimates which embody the category membership. Our experimental study based on two data sets, including Reuters-21578, demonstrates that the proposed probability based term weighting scheme outperforms tfidf significantly using Bayesian classifier and Support Vector Machines (SVM).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Longman Publishing Co, Boston, MA (1999)
Google Scholar
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the 2003 ACM symposium on Applied computing, ACM Press, New York (2003)
Google Scholar
Dumais, S., Chen, H.: Hierarchical classification of Web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR2000), ACM Press, New York (2000)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, Special Issue on Variable and Feature Selection 3, 1289–1305 (2003)
MATH Google Scholar
Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, Springer, Heidelberg (1998)
Chapter Google Scholar
Joachims, T.: A Statistical Learning Model of Text Classification with Support Vector Machines. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, New York (2001)
Google Scholar
Leopold, E., Kindermann, J.: Text Categorization with Support Vector Machines - How to Represent Texts in Input Space. Machine Learning 46, 423–444 (2002)
Article MATH Google Scholar
Liu, Y., Loh, H.T., Tor, S.B.: Building a Document Corpus for Manufacturing Knowledge Retrieval. In: Proceedings of the Singapore MIT Alliance Symposium (2004)
Google Scholar
Ng, H.T., Goh, W.B., Low, K.L.: Feature selection, perception learning, and a usability case study for text categorization. In: ACM SIGIR Forum. Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, New York (1997)
Google Scholar
Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (2003)
Google Scholar
Ruiz, M.E., Srinivasan, P.: Hierarchical Text Categorization Using Neural Networks. Information Retrieval 5, 87–118 (2002)
Article MATH Google Scholar
Salton, G., Buckley, C.: Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management 24, 513–523 (1988)
Article Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys (CSUR) 34, 1–47 (2002)
Article Google Scholar
Sun, A., Lim, E.-P., Ng, W.-K., Srivastava, J.: Blocking Reduction Strategies in Hierarchical Text Classification. IEEE Transactions on Knowledge and Data Engineering (TKDE) 16, 1305–1308 (2004)
Article Google Scholar
van_Rijsbergen, C.J.: Information Retrieval. 2nd edn. Butterworths, London, UK (1979)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer, New York (1999)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, New York (1999)
Google Scholar
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of ICML-97, 14th International Conference on Machine Learning (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong SAR, China
Ying Liu
Department of Mechanical Engineering, National University of Singapore, 21 Lower Kent Ridge Road,119077, Singapore
Han Tong Loh

Authors

Ying Liu
View author publications
You can also search for this author in PubMed Google Scholar
Han Tong Loh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Hiroshi G. Okuno Moonis Ali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Loh, H.T. (2007). A Simple Probability Based Term Weighting Scheme for Automated Text Classification. In: Okuno, H.G., Ali, M. (eds) New Trends in Applied Artificial Intelligence. IEA/AIE 2007. Lecture Notes in Computer Science(), vol 4570. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73325-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-540-73325-6_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73322-5
Online ISBN: 978-3-540-73325-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics