Using DragPushing to Refine Concept Index for Text Categorization

Cheng, Xueqi; Tan, Songbo; Tang, Lilian

doi:10.1007/s11390-006-0592-9

Using DragPushing to Refine Concept Index for Text Categorization

Semantic & Contents Computing
Published: July 2006

Volume 21, pages 592–596, (2006)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Xueqi Cheng¹,
Songbo Tan¹ &
Lilian Tang²

30 Accesses
2 Citations
Explore all metrics

Abstract

Concept index (CI) is a very fast and efficient feature extraction (FE) algorithm for text classification. The key approach in CI scheme is to express each document as a function of various concepts (centroids) present in the collection. However, the representative ability of centroids for categorizing corpus is often influenced by so-called model misfit caused by a number of factors in the FE process including feature selection to similarity measure. In order to address this issue, this work employs the “DragPushing” Strategy to refine the centroids that are used for concept index. We present an extensive experimental evaluation of refined concept index (RCI) on two English collections and one Chinese corpus using state-of-the-art Support Vector Machine (SVM) classifier. The results indicate that in each case, RCI-based SVM yields a much better performance than the normal CI-based SVM but lower computation cost during training and classification phases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Yang Y, Lin X. A re-examination of text categorization methods. In The 22nd ACM Int. Conf. Research and Development in Information Retrieval, Berkeley. 1999, pp.42–49.
E Han, G Karypis. Centroid-based document classification analysis & experimental result. In The Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases, France, 2000, pp.424–431.
D D Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In The 10th European Conference on Machine Learning, Springer Verlag, Heidelberg, DE, 1998, pp.4–15.
Chapter Google Scholar
Andrew McCallum, Kamal Nigam. A comparison of event models for Naive Bayes text classification. AAAI/ICML-98 Workshop on Learning for Text Categorization [C], Menlo Park, CA: AAAI Press, 1998, pp.41–48.
Google Scholar
P P T M van Mun. Text classification in information retrieval using Winnow. http://citeseer.csail.mit.edu/133034.html
T Joachims. Text categorization with support vector machines: Learning with many relevant features. In The 10th European Conference on Machine Learning, Springer Verlag, Heidelberg, DE, 1998, pp.137–142.
Chapter Google Scholar
G Salton, M J McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
Inderjit S Dhillon et al., Subramanyam Mallela, Rahul Kumar. A divisive information-theoretic feature clustering algorithm for text classification. Journal of Machine Learning Research, 2003, 3: 1265–1287.
Article Google Scholar
Liu H, Motoda H. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic, Norwell, MA, USA, 1998.
Google Scholar
Yang Y, Pedersen J O. A comparative study on feature selection in text categorization. In Proc. the 14th Int. Conf. Machine Learning Table of Contents, San Francisco, CA, USA, 1997, pp.412–420
Jolliffe I T. Principal Component Analysis. New York: Springer Verlag, 1986.
MATH Google Scholar
Martinez A M, Kak A C. PCA versus LDA. IEEE Trans. Pattern Analysis and Machine Intelligence, 2001, 23(2): 228–233.
Article Google Scholar
Li Haifeng, Jiang Tao, Zhang K. Efficient and robust feature extraction by maximum margin criterion. In Proceedings of the Advances in Neural Information Processing Systems 16, (Vancouver, Canada), MIT Press, 2004, pp.97–104.
Google Scholar
Roweis S T, Saul L K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290: 2323–2326.
George Karypis, EuiHong (Sam) Han. Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval. In The 9th ACM International Conference on Information and Knowledge Management, ACM Press, New York, US, 2000, pp.12–19.
Google Scholar
Rijsbergen C. Information Retrieval. London: Butterworths, 1979.
MATH Google Scholar
Malhi A, Gao R X. PCA-based feature selection scheme for machine defect classification. IEEE Transactions on Instrumentation and Measurement, 2004, 53(6): 1517–1525.
Article Google Scholar
Ran Gilad-Bachrach, Amir Navot, Tishby N. Margin based feature selection –- Theory and algorithms. The 21st Int. Conf. Machine Learning. Banff, Alberta, Canada. 2004, 43.
Douglas Hardin, Ioannis Tsamardinos, Aliferis C F. A theoretical characterization of linear SVM-based feature selection. In The Twenty-First International Conference on Machine Learning, Banff, Alberta, Canada. 2004, p.48.
Songbo Tan, Xue-Qi Cheng, Moustafa M Ghanem et al. A novel refinement approach for text categorization. In The 14th ACM Int. Conf. Information and Knowledge Management Table of Contents, Bremen, Germany, 2005, pp.469–476.

Download references

Author information

Authors and Affiliations

Division of Intelligent Software Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100080, P.R. China
Xueqi Cheng & Songbo Tan
Department of Computing, University of Surrey, U.K.
Lilian Tang

Authors

Xueqi Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Songbo Tan
View author publications
You can also search for this author in PubMed Google Scholar
Lilian Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xueqi Cheng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, X., Tan, S. & Tang, L. Using DragPushing to Refine Concept Index for Text Categorization. J Comput Sci Technol 21, 592–596 (2006). https://doi.org/10.1007/s11390-006-0592-9

Download citation

Received: 30 May 2006
Issue Date: July 2006
DOI: https://doi.org/10.1007/s11390-006-0592-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using DragPushing to Refine Concept Index for Text Categorization

Abstract

Access this article

Similar content being viewed by others

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Feature selection techniques for machine learning: a survey of more than two decades of research

Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using DragPushing to Refine Concept Index for Text Categorization

Abstract

Access this article

Similar content being viewed by others

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Feature selection techniques for machine learning: a survey of more than two decades of research

Evaluation of k-nearest neighbour classifier performance for heterogeneous data sets

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation