Abstract
Concept index (CI) is a very fast and efficient feature extraction (FE) algorithm for text classification. The key approach in CI scheme is to express each document as a function of various concepts (centroids) present in the collection. However, the representative ability of centroids for categorizing corpus is often influenced by so-called model misfit caused by a number of factors in the FE process including feature selection to similarity measure. In order to address this issue, this work employs the “DragPushing” Strategy to refine the centroids that are used for concept index. We present an extensive experimental evaluation of refined concept index (RCI) on two English collections and one Chinese corpus using state-of-the-art Support Vector Machine (SVM) classifier. The results indicate that in each case, RCI-based SVM yields a much better performance than the normal CI-based SVM but lower computation cost during training and classification phases.
Similar content being viewed by others
References
Yang Y, Lin X. A re-examination of text categorization methods. In The 22nd ACM Int. Conf. Research and Development in Information Retrieval, Berkeley. 1999, pp.42–49.
E Han, G Karypis. Centroid-based document classification analysis & experimental result. In The Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases, France, 2000, pp.424–431.
D D Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In The 10th European Conference on Machine Learning, Springer Verlag, Heidelberg, DE, 1998, pp.4–15.
Andrew McCallum, Kamal Nigam. A comparison of event models for Naive Bayes text classification. AAAI/ICML-98 Workshop on Learning for Text Categorization [C], Menlo Park, CA: AAAI Press, 1998, pp.41–48.
P P T M van Mun. Text classification in information retrieval using Winnow. http://citeseer.csail.mit.edu/133034.html
T Joachims. Text categorization with support vector machines: Learning with many relevant features. In The 10th European Conference on Machine Learning, Springer Verlag, Heidelberg, DE, 1998, pp.137–142.
G Salton, M J McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
Inderjit S Dhillon et al., Subramanyam Mallela, Rahul Kumar. A divisive information-theoretic feature clustering algorithm for text classification. Journal of Machine Learning Research, 2003, 3: 1265–1287.
Liu H, Motoda H. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic, Norwell, MA, USA, 1998.
Yang Y, Pedersen J O. A comparative study on feature selection in text categorization. In Proc. the 14th Int. Conf. Machine Learning Table of Contents, San Francisco, CA, USA, 1997, pp.412–420
Jolliffe I T. Principal Component Analysis. New York: Springer Verlag, 1986.
Martinez A M, Kak A C. PCA versus LDA. IEEE Trans. Pattern Analysis and Machine Intelligence, 2001, 23(2): 228–233.
Li Haifeng, Jiang Tao, Zhang K. Efficient and robust feature extraction by maximum margin criterion. In Proceedings of the Advances in Neural Information Processing Systems 16, (Vancouver, Canada), MIT Press, 2004, pp.97–104.
Roweis S T, Saul L K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290: 2323–2326.
George Karypis, EuiHong (Sam) Han. Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval. In The 9th ACM International Conference on Information and Knowledge Management, ACM Press, New York, US, 2000, pp.12–19.
Rijsbergen C. Information Retrieval. London: Butterworths, 1979.
Malhi A, Gao R X. PCA-based feature selection scheme for machine defect classification. IEEE Transactions on Instrumentation and Measurement, 2004, 53(6): 1517–1525.
Ran Gilad-Bachrach, Amir Navot, Tishby N. Margin based feature selection –- Theory and algorithms. The 21st Int. Conf. Machine Learning. Banff, Alberta, Canada. 2004, 43.
Douglas Hardin, Ioannis Tsamardinos, Aliferis C F. A theoretical characterization of linear SVM-based feature selection. In The Twenty-First International Conference on Machine Learning, Banff, Alberta, Canada. 2004, p.48.
Songbo Tan, Xue-Qi Cheng, Moustafa M Ghanem et al. A novel refinement approach for text categorization. In The 14th ACM Int. Conf. Information and Knowledge Management Table of Contents, Bremen, Germany, 2005, pp.469–476.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cheng, X., Tan, S. & Tang, L. Using DragPushing to Refine Concept Index for Text Categorization. J Comput Sci Technol 21, 592–596 (2006). https://doi.org/10.1007/s11390-006-0592-9
Received:
Issue Date:
DOI: https://doi.org/10.1007/s11390-006-0592-9