CWC: A Clustering-Based Feature Weighting Approach for Text Classification

Zhu, Lin; Guan, Jihong; Zhou, Shuigeng

doi:10.1007/978-3-540-73729-2_20

Lin Zhu¹,
Jihong Guan² &
Shuigeng Zhou¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4617))

Included in the following conference series:

International Conference on Modeling Decisions for Artificial Intelligence

1560 Accesses

Abstract

Most existing text classification methods use the vector space model to represent documents, and the document vectors are evaluated by the TF-IDF method. However, TF-IDF weighting does not take into account the fact that the weight of a feature in a document is related not only to the document, but also to the class that document belongs to. In this paper, we present a Clustering-based feature Weighting approach for text Classification, or CWC for short. CWC takes each class in the training collection as a known cluster, and searches for feature weights iteratively to optimize the clustering objective function, so the best clustering result is achieved, and documents in different classes can be best distinguished by using the resulting feature weights. Performance of CWC is validated by conducting classification over two real text collections, and experimental results show that CWC outperforms the traditional KNN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Apte, C., Weiss, S.: Data Mining with Decision Trees and Decision Rules. Future Generation Computer Systems 13, 197–210 (1997)
Article Google Scholar
Yang, Y., Chute, C.G.: An Example-based Mapping Method for Text Categorization and Retrieval. ACM Transaction on Information Systems (TOIS) 12, 252–277 (1994)
Article Google Scholar
Lam, W., Ho, C.Y.: Using a Generalized Instance Set for Automatic Text Categorization. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 81–89 (1998)
Google Scholar
Known, O.W., Lee, J.H.: Text categorization based on k-nearest neighbor approach for Web site classification. Information Processing and Management 39, 25–44 (2003)
Article Google Scholar
Lewis, D.D.: Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Chapter Google Scholar
Cohen, W.W., Singer, Y.: Context-Sensitive Learning Methods for Text Categorization. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–315. ACM Press, New York (1996)
Google Scholar
Wiener, E., Pedersen, J.O., Weigend, A.S.: A Neural Network Approach to Topic Spotting. In: SDAIR. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 317–332 (1995)
Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Schapire, R.E., Singer, Y.: Improved Boosting Algorithms Using Confidence-rated Predictions. In: Proceedings of 11th Annual Conference on Computational Learning Theory, pp. 80–91 (1998)
Google Scholar
Yang, Y., Liu, X.: A Re-examination of Text Categorization. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49. ACM Press, New York (1999)
Google Scholar
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1, 67–88 (1999)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A Vector Space Model Gets Automatic Indexing. In: Jones, K.S., Willett, P. (eds.) Readings in Information Retrieval, pp. 273–280. Morgan Kaufmann, San Francisco (1997)
Google Scholar
Frakes, W.B., Baeza-Yates, R. (eds.): Information Retrieval: Data Structures-Algorithms. Prentice Hall PTR, Upper Saddle River, NJ, USA (1992)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)
Article MathSciNet Google Scholar
Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 412–420 (1997)
Google Scholar
Zheng, Z.H., Wu, X.Y., Srihari, R.: Feature Selection for Text Categorization on Imbalanced Data. SIGKDD Explorations 6, 80–89 (2004)
Article Google Scholar
Wang, G., Lochovsky, F.H., Yang, Q.: Feature Selection with Conditional Mutual Information MaxiMin in Text Categorization. In: Proceedings of CIKM 2004. pp. 342–349, Washington, DC, USA (2004)
Google Scholar
Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. JMLR, Special Issue on Variable and Feature Selection 3, 1289–1305 (2003)
MATH Google Scholar
Frigui, H., Nasraoui, O.: Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents. In: Berry, M. (ed.) Survey of Text Mining, pp. 45–70. Springer, Heidelberg (2004)
Google Scholar
Chan, E.Y., Ching, W.–K., Ng, M.K., Huang, J.: An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognition 37(5), 943–952 (2004)
Article MATH Google Scholar
Torra, V., Miyamoto, S., Lanau, S.: Exploration of textual databases using a fuzzy hierarchical clustering algorithm in the GAMBAL system. Information Processing and Management 41(3), 587–598 (2005)
Article MATH Google Scholar
McCallum, A.K.: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering (1996), http://www.cs.cmu.edu/mccallum/bow

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Fudan University, 200433, China
Lin Zhu & Shuigeng Zhou
Department of Computer Science and Technology, Tongji University, 201804, China
Jihong Guan

Authors

Lin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Jihong Guan
View author publications
You can also search for this author in PubMed Google Scholar
Shuigeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Vicenç Torra Yasuo Narukawa Yuji Yoshida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, L., Guan, J., Zhou, S. (2007). CWC: A Clustering-Based Feature Weighting Approach for Text Classification. In: Torra, V., Narukawa, Y., Yoshida, Y. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2007. Lecture Notes in Computer Science(), vol 4617. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73729-2_20

Download citation

DOI: https://doi.org/10.1007/978-3-540-73729-2_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73728-5
Online ISBN: 978-3-540-73729-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics