Abstract
Most existing text classification methods use the vector space model to represent documents, and the document vectors are evaluated by the TF-IDF method. However, TF-IDF weighting does not take into account the fact that the weight of a feature in a document is related not only to the document, but also to the class that document belongs to. In this paper, we present a Clustering-based feature Weighting approach for text Classification, or CWC for short. CWC takes each class in the training collection as a known cluster, and searches for feature weights iteratively to optimize the clustering objective function, so the best clustering result is achieved, and documents in different classes can be best distinguished by using the resulting feature weights. Performance of CWC is validated by conducting classification over two real text collections, and experimental results show that CWC outperforms the traditional KNN.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Apte, C., Weiss, S.: Data Mining with Decision Trees and Decision Rules. Future Generation Computer Systems 13, 197–210 (1997)
Yang, Y., Chute, C.G.: An Example-based Mapping Method for Text Categorization and Retrieval. ACM Transaction on Information Systems (TOIS) 12, 252–277 (1994)
Lam, W., Ho, C.Y.: Using a Generalized Instance Set for Automatic Text Categorization. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 81–89 (1998)
Known, O.W., Lee, J.H.: Text categorization based on k-nearest neighbor approach for Web site classification. Information Processing and Management 39, 25–44 (2003)
Lewis, D.D.: Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Cohen, W.W., Singer, Y.: Context-Sensitive Learning Methods for Text Categorization. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–315. ACM Press, New York (1996)
Wiener, E., Pedersen, J.O., Weigend, A.S.: A Neural Network Approach to Topic Spotting. In: SDAIR. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 317–332 (1995)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Schapire, R.E., Singer, Y.: Improved Boosting Algorithms Using Confidence-rated Predictions. In: Proceedings of 11th Annual Conference on Computational Learning Theory, pp. 80–91 (1998)
Yang, Y., Liu, X.: A Re-examination of Text Categorization. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49. ACM Press, New York (1999)
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1, 67–88 (1999)
Salton, G., Wong, A., Yang, C.S.: A Vector Space Model Gets Automatic Indexing. In: Jones, K.S., Willett, P. (eds.) Readings in Information Retrieval, pp. 273–280. Morgan Kaufmann, San Francisco (1997)
Frakes, W.B., Baeza-Yates, R. (eds.): Information Retrieval: Data Structures-Algorithms. Prentice Hall PTR, Upper Saddle River, NJ, USA (1992)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 412–420 (1997)
Zheng, Z.H., Wu, X.Y., Srihari, R.: Feature Selection for Text Categorization on Imbalanced Data. SIGKDD Explorations 6, 80–89 (2004)
Wang, G., Lochovsky, F.H., Yang, Q.: Feature Selection with Conditional Mutual Information MaxiMin in Text Categorization. In: Proceedings of CIKM 2004. pp. 342–349, Washington, DC, USA (2004)
Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. JMLR, Special Issue on Variable and Feature Selection 3, 1289–1305 (2003)
Frigui, H., Nasraoui, O.: Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents. In: Berry, M. (ed.) Survey of Text Mining, pp. 45–70. Springer, Heidelberg (2004)
Chan, E.Y., Ching, W.–K., Ng, M.K., Huang, J.: An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognition 37(5), 943–952 (2004)
Torra, V., Miyamoto, S., Lanau, S.: Exploration of textual databases using a fuzzy hierarchical clustering algorithm in the GAMBAL system. Information Processing and Management 41(3), 587–598 (2005)
McCallum, A.K.: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering (1996), http://www.cs.cmu.edu/mccallum/bow
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhu, L., Guan, J., Zhou, S. (2007). CWC: A Clustering-Based Feature Weighting Approach for Text Classification. In: Torra, V., Narukawa, Y., Yoshida, Y. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2007. Lecture Notes in Computer Science(), vol 4617. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73729-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-540-73729-2_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73728-5
Online ISBN: 978-3-540-73729-2
eBook Packages: Computer ScienceComputer Science (R0)