Skip to main content

CWC: A Clustering-Based Feature Weighting Approach for Text Classification

  • Conference paper
Modeling Decisions for Artificial Intelligence (MDAI 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4617))

  • 1560 Accesses

Abstract

Most existing text classification methods use the vector space model to represent documents, and the document vectors are evaluated by the TF-IDF method. However, TF-IDF weighting does not take into account the fact that the weight of a feature in a document is related not only to the document, but also to the class that document belongs to. In this paper, we present a Clustering-based feature Weighting approach for text Classification, or CWC for short. CWC takes each class in the training collection as a known cluster, and searches for feature weights iteratively to optimize the clustering objective function, so the best clustering result is achieved, and documents in different classes can be best distinguished by using the resulting feature weights. Performance of CWC is validated by conducting classification over two real text collections, and experimental results show that CWC outperforms the traditional KNN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apte, C., Weiss, S.: Data Mining with Decision Trees and Decision Rules. Future Generation Computer Systems 13, 197–210 (1997)

    Article  Google Scholar 

  2. Yang, Y., Chute, C.G.: An Example-based Mapping Method for Text Categorization and Retrieval. ACM Transaction on Information Systems (TOIS) 12, 252–277 (1994)

    Article  Google Scholar 

  3. Lam, W., Ho, C.Y.: Using a Generalized Instance Set for Automatic Text Categorization. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 81–89 (1998)

    Google Scholar 

  4. Known, O.W., Lee, J.H.: Text categorization based on k-nearest neighbor approach for Web site classification. Information Processing and Management 39, 25–44 (2003)

    Article  Google Scholar 

  5. Lewis, D.D.: Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  6. Cohen, W.W., Singer, Y.: Context-Sensitive Learning Methods for Text Categorization. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–315. ACM Press, New York (1996)

    Google Scholar 

  7. Wiener, E., Pedersen, J.O., Weigend, A.S.: A Neural Network Approach to Topic Spotting. In: SDAIR. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 317–332 (1995)

    Google Scholar 

  8. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) Machine Learning: ECML-98. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  9. Schapire, R.E., Singer, Y.: Improved Boosting Algorithms Using Confidence-rated Predictions. In: Proceedings of 11th Annual Conference on Computational Learning Theory, pp. 80–91 (1998)

    Google Scholar 

  10. Yang, Y., Liu, X.: A Re-examination of Text Categorization. In: Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49. ACM Press, New York (1999)

    Google Scholar 

  11. Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1, 67–88 (1999)

    Google Scholar 

  12. Salton, G., Wong, A., Yang, C.S.: A Vector Space Model Gets Automatic Indexing. In: Jones, K.S., Willett, P. (eds.) Readings in Information Retrieval, pp. 273–280. Morgan Kaufmann, San Francisco (1997)

    Google Scholar 

  13. Frakes, W.B., Baeza-Yates, R. (eds.): Information Retrieval: Data Structures-Algorithms. Prentice Hall PTR, Upper Saddle River, NJ, USA (1992)

    Google Scholar 

  14. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34, 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  15. Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 412–420 (1997)

    Google Scholar 

  16. Zheng, Z.H., Wu, X.Y., Srihari, R.: Feature Selection for Text Categorization on Imbalanced Data. SIGKDD Explorations 6, 80–89 (2004)

    Article  Google Scholar 

  17. Wang, G., Lochovsky, F.H., Yang, Q.: Feature Selection with Conditional Mutual Information MaxiMin in Text Categorization. In: Proceedings of CIKM 2004. pp. 342–349, Washington, DC, USA (2004)

    Google Scholar 

  18. Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. JMLR, Special Issue on Variable and Feature Selection 3, 1289–1305 (2003)

    MATH  Google Scholar 

  19. Frigui, H., Nasraoui, O.: Simultaneous Clustering and Dynamic Keyword Weighting for Text Documents. In: Berry, M. (ed.) Survey of Text Mining, pp. 45–70. Springer, Heidelberg (2004)

    Google Scholar 

  20. Chan, E.Y., Ching, W.–K., Ng, M.K., Huang, J.: An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognition 37(5), 943–952 (2004)

    Article  MATH  Google Scholar 

  21. Torra, V., Miyamoto, S., Lanau, S.: Exploration of textual databases using a fuzzy hierarchical clustering algorithm in the GAMBAL system. Information Processing and Management 41(3), 587–598 (2005)

    Article  MATH  Google Scholar 

  22. McCallum, A.K.: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering (1996), http://www.cs.cmu.edu/mccallum/bow

Download references

Author information

Authors and Affiliations

Authors

Editor information

Vicenç Torra Yasuo Narukawa Yuji Yoshida

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhu, L., Guan, J., Zhou, S. (2007). CWC: A Clustering-Based Feature Weighting Approach for Text Classification. In: Torra, V., Narukawa, Y., Yoshida, Y. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2007. Lecture Notes in Computer Science(), vol 4617. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73729-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-73729-2_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-73728-5

  • Online ISBN: 978-3-540-73729-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics