Abstract
It is common that representative words in a document are identified and discriminated by their statistical distribution of their frequency statistics. We assume that evaluating the confidence measure of terms through content-based document analysis leads to a better performance than the parametric assumptions of the standard frequency-based method. In this paper, we propose a new approach of term weighting method that replaces the frequency-based probabilistic methods. Experiments on Naïve Bayesian classifiers showed that our approach achieved an improvement compared to the frequency-based method on each point of the evaluation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Yang, Y., Zhang, J., Kisiel, B.: A Scalability Analysis of Classifiers in Text Categorization. In: SIGIR 2003, pp. 96–103 (2003)
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of Int. Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)
Bennett, P.: Using symmetric Distributions to Improve Text Classifier Probability Estimates. In: SIGIR 2003, pp. 111–118 (2003)
Yang, Y., Pedersen, J.P.: A Comparative Study on Feature Selection in Text Categorization. In: Fisher Jr., D.H. (ed.) Proceedings of the 14th Int. Conference on Machine Learning, pp. 412–420 (1997)
Lam, W., Lai, K.: A Meta-Learning Approach for Text Categorization. In: SIGIR 2001, pp. 303–309 (2001)
Robertson, S.: The Probability Ranking Principle in IR, pp. 281–286. Morgan Kaufmann Publishers, San Francisco (1997)
Bekkerman, R., El-Yaniv, R., Tisshby, N., Winter, Y.: On Feature Distributional Clustering for Text Categorization. In: SIGIR 2001, pp. 146–153 (2001)
Kawatani, T.: Topic Difference Factor Extraction between Two Document Sets and its Application to Text Categorization. In: SIGIR 2002, pp. 137–144 (2002)
Rijsbergen, C., Harper, D., Porter, M.: The Selection of Good Search Terms. Information Processing and Management 17, 77–91 (1981)
Lai, Y., Wu, C.: Meaningful Term Extraction and Discriminative Term Selection in Text Categorization via Unknown-Word Methodology. ACM Transactions on Asian Languages Information Processing 1(1), 34–64 (2002)
Yang, Y.: A Study on Thresholding Strategies for Text Categorization. In: Proceedings of SIGIR 2001, pp. 137–145 (2001)
Kang, S., Lee, H., Son, S., Hong, G., Moon, B.: Term Weighting Method by Postposition and Compound Noun Recognition. In: Proceedings of the 13th Conference on Korean Language Computing, pp. 196–198 (2001)
Ko, Y., Park, J., Seo, J.: Automatic Text Categorization using the Importance of Sentences. Journal of Korean Information Science Society: Software and Application, 417–423 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lee, KC., Kang, SS., Hahn, KS. (2005). A Term Weighting Approach for Text Categorization. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_66
Download citation
DOI: https://doi.org/10.1007/11562382_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29186-2
Online ISBN: 978-3-540-32001-2
eBook Packages: Computer ScienceComputer Science (R0)