Abstract
In the automated text classification, tfidf is often considered as the default term weighting scheme and has been widely reported in literature. However, tfidf does not directly reflect terms’ category membership. Inspired by the analysis of various feature selection methods, we propose a simple probability based term weighting scheme which directly utilizes two critical information ratios, i.e. relevance indicators. These relevance indicators are nicely supported by probability estimates which embody the category membership. Our experimental study based on two data sets, including Reuters-21578, demonstrates that the proposed probability based term weighting scheme outperforms tfidf significantly using Bayesian classifier and Support Vector Machines (SVM).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Longman Publishing Co, Boston, MA (1999)
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the 2003 ACM symposium on Applied computing, ACM Press, New York (2003)
Dumais, S., Chen, H.: Hierarchical classification of Web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR2000), ACM Press, New York (2000)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, Special Issue on Variable and Feature Selection 3, 1289–1305 (2003)
Joachims, T.: Text categorization with Support Vector Machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, Springer, Heidelberg (1998)
Joachims, T.: A Statistical Learning Model of Text Classification with Support Vector Machines. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, New York (2001)
Leopold, E., Kindermann, J.: Text Categorization with Support Vector Machines - How to Represent Texts in Input Space. Machine Learning 46, 423–444 (2002)
Liu, Y., Loh, H.T., Tor, S.B.: Building a Document Corpus for Manufacturing Knowledge Retrieval. In: Proceedings of the Singapore MIT Alliance Symposium (2004)
Ng, H.T., Goh, W.B., Low, K.L.: Feature selection, perception learning, and a usability case study for text categorization. In: ACM SIGIR Forum. Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, New York (1997)
Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (2003)
Ruiz, M.E., Srinivasan, P.: Hierarchical Text Categorization Using Neural Networks. Information Retrieval 5, 87–118 (2002)
Salton, G., Buckley, C.: Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management 24, 513–523 (1988)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys (CSUR) 34, 1–47 (2002)
Sun, A., Lim, E.-P., Ng, W.-K., Srivastava, J.: Blocking Reduction Strategies in Hierarchical Text Classification. IEEE Transactions on Knowledge and Data Engineering (TKDE) 16, 1305–1308 (2004)
van_Rijsbergen, C.J.: Information Retrieval. 2nd edn. Butterworths, London, UK (1979)
Vapnik, V.N.: The Nature of Statistical Learning Theory, 2nd edn. Springer, New York (1999)
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, New York (1999)
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of ICML-97, 14th International Conference on Machine Learning (1997)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Liu, Y., Loh, H.T. (2007). A Simple Probability Based Term Weighting Scheme for Automated Text Classification. In: Okuno, H.G., Ali, M. (eds) New Trends in Applied Artificial Intelligence. IEA/AIE 2007. Lecture Notes in Computer Science(), vol 4570. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73325-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-540-73325-6_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73322-5
Online ISBN: 978-3-540-73325-6
eBook Packages: Computer ScienceComputer Science (R0)