Skip to main content
Log in

Using DragPushing to Refine Concept Index for Text Categorization

  • Semantic & Contents Computing
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Concept index (CI) is a very fast and efficient feature extraction (FE) algorithm for text classification. The key approach in CI scheme is to express each document as a function of various concepts (centroids) present in the collection. However, the representative ability of centroids for categorizing corpus is often influenced by so-called model misfit caused by a number of factors in the FE process including feature selection to similarity measure. In order to address this issue, this work employs the “DragPushing” Strategy to refine the centroids that are used for concept index. We present an extensive experimental evaluation of refined concept index (RCI) on two English collections and one Chinese corpus using state-of-the-art Support Vector Machine (SVM) classifier. The results indicate that in each case, RCI-based SVM yields a much better performance than the normal CI-based SVM but lower computation cost during training and classification phases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Yang Y, Lin X. A re-examination of text categorization methods. In The 22nd ACM Int. Conf. Research and Development in Information Retrieval, Berkeley. 1999, pp.42–49.

  2. E Han, G Karypis. Centroid-based document classification analysis & experimental result. In The Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases, France, 2000, pp.424–431.

  3. D D Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In The 10th European Conference on Machine Learning, Springer Verlag, Heidelberg, DE, 1998, pp.4–15.

    Chapter  Google Scholar 

  4. Andrew McCallum, Kamal Nigam. A comparison of event models for Naive Bayes text classification. AAAI/ICML-98 Workshop on Learning for Text Categorization [C], Menlo Park, CA: AAAI Press, 1998, pp.41–48.

    Google Scholar 

  5. P P T M van Mun. Text classification in information retrieval using Winnow. http://citeseer.csail.mit.edu/133034.html

  6. T Joachims. Text categorization with support vector machines: Learning with many relevant features. In The 10th European Conference on Machine Learning, Springer Verlag, Heidelberg, DE, 1998, pp.137–142.

    Chapter  Google Scholar 

  7. G Salton, M J McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.

  8. Inderjit S Dhillon et al., Subramanyam Mallela, Rahul Kumar. A divisive information-theoretic feature clustering algorithm for text classification. Journal of Machine Learning Research, 2003, 3: 1265–1287.

    Article  Google Scholar 

  9. Liu H, Motoda H. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic, Norwell, MA, USA, 1998.

    Google Scholar 

  10. Yang Y, Pedersen J O. A comparative study on feature selection in text categorization. In Proc. the 14th Int. Conf. Machine Learning Table of Contents, San Francisco, CA, USA, 1997, pp.412–420

  11. Jolliffe I T. Principal Component Analysis. New York: Springer Verlag, 1986.

    MATH  Google Scholar 

  12. Martinez A M, Kak A C. PCA versus LDA. IEEE Trans. Pattern Analysis and Machine Intelligence, 2001, 23(2): 228–233.

    Article  Google Scholar 

  13. Li Haifeng, Jiang Tao, Zhang K. Efficient and robust feature extraction by maximum margin criterion. In Proceedings of the Advances in Neural Information Processing Systems 16, (Vancouver, Canada), MIT Press, 2004, pp.97–104.

    Google Scholar 

  14. Roweis S T, Saul L K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290: 2323–2326.

  15. George Karypis, EuiHong (Sam) Han. Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval. In The 9th ACM International Conference on Information and Knowledge Management, ACM Press, New York, US, 2000, pp.12–19.

    Google Scholar 

  16. Rijsbergen C. Information Retrieval. London: Butterworths, 1979.

    MATH  Google Scholar 

  17. Malhi A, Gao R X. PCA-based feature selection scheme for machine defect classification. IEEE Transactions on Instrumentation and Measurement, 2004, 53(6): 1517–1525.

    Article  Google Scholar 

  18. Ran Gilad-Bachrach, Amir Navot, Tishby N. Margin based feature selection –- Theory and algorithms. The 21st Int. Conf. Machine Learning. Banff, Alberta, Canada. 2004, 43.

  19. Douglas Hardin, Ioannis Tsamardinos, Aliferis C F. A theoretical characterization of linear SVM-based feature selection. In The Twenty-First International Conference on Machine Learning, Banff, Alberta, Canada. 2004, p.48.

  20. Songbo Tan, Xue-Qi Cheng, Moustafa M Ghanem et al. A novel refinement approach for text categorization. In The 14th ACM Int. Conf. Information and Knowledge Management Table of Contents, Bremen, Germany, 2005, pp.469–476.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xueqi Cheng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, X., Tan, S. & Tang, L. Using DragPushing to Refine Concept Index for Text Categorization. J Comput Sci Technol 21, 592–596 (2006). https://doi.org/10.1007/s11390-006-0592-9

Download citation

  • Received:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-006-0592-9

Keywords

Navigation