CenKNN: a scalable and effective text classifier

Pang, Guansong; Jin, Huidong; Jiang, Shengyi

doi:10.1007/s10618-014-0358-x

CenKNN: a scalable and effective text classifier

Published: 03 July 2014

Volume 29, pages 593–625, (2015)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Guansong Pang¹,
Huidong Jin² &
Shengyi Jiang³

1066 Accesses
23 Citations
Explore all metrics

Abstract

A big challenge in text classification is to perform classification on a large-scale and high-dimensional text corpus in the presence of imbalanced class distributions and a large number of irrelevant or noisy term features. A number of techniques have been proposed to handle this challenge with varying degrees of success. In this paper, by combining the strengths of two widely used text classification techniques, K-Nearest-Neighbor (KNN) and centroid based (Centroid) classifiers, we propose a scalable and effective flat classifier, called CenKNN, to cope with this challenge. CenKNN projects high-dimensional (often hundreds of thousands) documents into a low-dimensional (normally a few dozen) space spanned by class centroids, and then uses the \(k\)-d tree structure to find \(K\) nearest neighbors efficiently. Due to the strong representation power of class centroids, CenKNN overcomes two issues related to existing KNN text classifiers, i.e., sensitivity to imbalanced class distributions and irrelevant or noisy term features. By working on projected low-dimensional data, CenKNN substantially reduces the expensive computation time in KNN. CenKNN also works better than Centroid since it uses all the class centroids to define similarity and works well on complex data, i.e., non-linearly separable data and data with local patterns within each class. A series of experiments on both English and Chinese, benchmark and synthetic corpora demonstrates that although CenKNN works on a significantly lower-dimensional space, it performs substantially better than KNN and its five variants, and existing scalable classifiers, including Centroid and Rocchio. CenKNN is also empirically preferable to another well-known classifier, support vector machines, on highly imbalanced corpora with a small number of classes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model

Information-theoretic term weighting schemes for document clustering and classification

Article 30 July 2014

Weimao Ke

Semantic Classifier Approach to Document Classification

Notes

In this paper, classification accuracy is referred to the performance in micro-averaging \(\hbox {F}_{1}\) (denoted by \(microF_{1})\) and macro-averaging \(\hbox {F}_{1}\) (denoted by \({macroF}_{1})\) values, rather than the ratio between the number of correctly classified test documents and the total number of test documents.
By “flat” classifiers, we focus on text classification tasks without considering a class hierarchy. Flat classifiers are the building blocks for successful hierarchical text classifiers.
Our experiments showed that another variant we proposed in Pang and Jiang (2013) had quite similar performance as INNTC (Jiang et al. 2012). So we only compared CenKNN with INNTC rather than both of them.
Reuters-21578 is available at http://archive.ics.uci.edu/ml/databases/reuters21578/.
20Newsgroup is available at http://qwone.com/~jason/20Newsgroups/.
TanCorp is available at http://www.searchforum.org.cn/tansongbo/corpus.htm.
Fudan University text classification corpus is available at http://www.nlp.org.cn/docs/download.php?doc_id=294.
DMOZ datasets are available at http://lshtc.iit.demokritos.gr/node/3.
This version of Tan12 is available at http://www.scholat.com/vpost.html?pid=3047.
In our significance test, for each classifier per data set, the \(F_{1}\) values for each class of all the classes were used as sample data. The paired-sample \(t\)-test was used to test the null hypothesis that the pairwise difference between the \(F_{1}\) values of two classifiers has a mean equal to zero. 0.05 is used as the statistical significance level throughout this paper.
Hereafter this format refers to the classifier working on a space with a specific dimension.
These synthetic corpora are available at http://www.scholat.com/vpost.html?pid=2395.
The results of CenSVM using different kernels show that non-linear kernels perform better than the linear kernel on the five corpora. This suggests that documents are often not linearly separable in the low-dimensional class-centroid-based space. Our results also show that the RBF kernel outperforms the polynomial kernel. These results have been made available at http://www.scholat.com/portalPaperInfo_Eng.html?paperID=19993&Entry=pines.
These synthetic corpora are available at http://www.scholat.com/vpost.html?pid=2396.
These two subsets are available at http://www.scholat.com/vpost.html?pid=4038.

References

Achlioptas D (2003) Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687
Article MATH MathSciNet Google Scholar
Aggarwal CC, Zhai C (2012) A survey of text classification algorithms. Mining text data. Springer, New York
Chapter Google Scholar
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29
Article Google Scholar
Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517
Article MATH MathSciNet Google Scholar
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp. 245–250 (2001)
Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Google Scholar
Chen Y, Hung Y, Yen T, Fuh C (2007) Fast and versatile algorithm for nearest neighbor search based on a lower bound tree. Pattern Recognit 40(2):360–375
Article MATH Google Scholar
Cunningham P, Delany SJ (2007) k-Nearest neighbour classifiers. Dublin: Technical Report UCD-CSI-2007-4
Du L, Buntine W, Jin H (2010) A segmented topic model based on the two-parameter Poisson–Dirichlet process. Mach Learn 81(1):5–19
Article MathSciNet Google Scholar
Du L, Buntine W, Jin H, Chen C (2012) Sequential latent Dirichlet allocation. Knowl Inf Syst 31(3):475–503
Article Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
MATH Google Scholar
Guan H, Zhou J, Guo M (2009) A class-feature-centroid classifier for text categorization. In: Proceedings of the 18th international conference on World Wide Web, pp. 201–210 (2009)
Guo G, Wang H, Bell D, Bi Y, Greer K (2006) Using kNN model for automatic text categorization. Soft Comput 10(5):423–430
Article Google Scholar
Han EH, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, pp. 116–123 (2000)
Han E, Karypis G, Kumar V (2001) Text categorization using weight adjusted k-nearest neighbor classification. In: Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 53–65
Han X, Li S, Shen Z (2012) A k-NN method for large scale hierarchical text classification at LSHTC3. In: Third Pascal Challenge on Large Scale Hierarchical Text classification (2012)
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Jagadish HV, Ooi BC, Tan K, Yu C, Zhang R (2005) iDistance: an adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans Database Syst (TODS) 30(2):364–397
Article Google Scholar
Jiang S, Pang G, Wu M, Kuang L (2012) An improved K-nearest-neighbor algorithm for text categorization. Expert Syst Appl 39(1):1503–1509
Article Google Scholar
Joachims T (1996) A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 143–151 (1996)
Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning, pp. 137–142
Joachims T (2001) A statistical learning learning model of text classification for support vector machines. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, pp. 128–136 (2001)
Katayama N, Satoh S (1997) The SR-tree: an index structure for high-dimensional nearest neighbor queries. ACM SIGMOD Rec 26:369–380
Article Google Scholar
Kim H, Howland P, Park H (2005) Dimension reduction in text classification with support vector machines. J Mach Learn Res 6:37–53
MATH MathSciNet Google Scholar
Kosmopoulos A, Gaussier E, Paliouras G, Aseervatham S (2010) The ECIR 2010 large scale hierarchical classification workshop. ACM SIGIR Forum 44(1):23–32
Article Google Scholar
Lam W, Han Y (2003) Automatic textual document categorization based on generalized instance sets and a metamodel. IEEE Trans Pattern Anal Mach Intell 25(5):628–633
Article Google Scholar
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Article Google Scholar
Lin J, Gunopulos D (2003) Dimensionality reduction by random projection and latent semantic indexing. In: Proceedings of SDM’2003 Workshop on Text Mining Workshop (2003)
Liu T, Chen Z, Zhang B, Ma W, Wu G (2004) Improving text classification using local latent semantic indexing. In: Proceedings of the 4th IEEE International Conference on Data Mining, pp. 162–169 (2004)
Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of ICML’2003 Workshop on Learning from Imbalanced Datasets (2003)
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Book MATH Google Scholar
Miao Y., Qiu X. Hierarchical centroid-based classifier for large scale text classification. In: First Pascal Challenge on Large Scale Hierarchical Text classification (2009).
Moore AW, Hall T (1990) Efficient memory-based learning for robot control. Doctoral dissertation, University of Cambridge (1990)
Pang G, Jiang S (2013) A generalized cluster centroid based classifier for text categorization. Inf Process Manag 49(2):576–586
Article MathSciNet Google Scholar
Pang G, Jiang S, Chen D (2013) A simple integration of social relationship and text data for identifying potential customers in microblogging. Advanced data mining and applications. Springer, Berlin
Google Scholar
Papadimitriou CH, Tamaki H, Raghavan P, Vempala S (1998) Latent semantic indexing: a probabilistic analysis. In: Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, pp. 159–168 (1998)
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article MATH Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Article Google Scholar
Sun JT, Chen Z, Zeng HJ, Lu YC, Shi CY, Ma WY (2004) Supervised latent semantic indexing for document categorization. In: Proceedings of the 4th IEEE International Conference on Data Mining, pp. 535–538 (2004)
Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recog Artif Intell 23(4):687–719
Article Google Scholar
Tan S (2005) Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst Appl 28(4):667–671
Article Google Scholar
Tan S (2006) An effective refinement strategy for KNN text classifier. Expert Syst Appl 30(2):290–298
Article Google Scholar
Tan S, Cheng X (2007) An effective approach to enhance centroid classifier for text categorization. In: Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases, pp. 581–588 (2007)
Tang L, Liu H (2005) Bias analysis in text classification for highly skewed data. In: Proceedings of the 5th IEEE International Conference on Data Mining, pp. 781–784 (2005)
Vilalta R, Achari M, Eick CF (2003) Class decomposition via clustering: a new framework for low-variance classifiers. In: Proceedings of the 3rd IEEE International Conference on Data Mining, pp. 673–676 (2003)
Wan CH, Lee LH, Rajkumar R, Isa D (2012) A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine. Expert Syst Appl 39(15):11880–11888
Article Google Scholar
Wang X, Zhao H, Lu B (2011) Enhance k-nearest neighbour algorithm for large-scale multi-labeled hierarchical classification. In: Second Pascal Challenge on Large Scale Hierarchical Text classification (2011)
Wang X, Zhao H, Lu B (2013) A Meta-Top-down Method for Large-scale Hierarchical Classification. IEEE Trans Knowl Data Eng, 99 (2013). doi:10.1109/TKDE.2013.30
Wettschereck D, Aha DW, Mohri T (1997) A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artif Intell Rev 11(1–5):273–314
Article Google Scholar
Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, Mclachlan GJ, Ng A, Liu B, Yu PS (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Article Google Scholar
Yang Y (1994) Expert network: effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, pp. 13–22 (1994)
Yang Y, Ault T, Pierce T, Lattimer CW (2000) Improving text categorization methods for event tracking. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pp. 65–72
Yang H, King I (2009) Sprinkled latent semantic indexing for text classification with background knowledge. Lect Notes Comput Sci 5507:53–60
Article Google Scholar
Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, pp. 42–49
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 412–420
Zhang M, Zhou Z (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit 40(7):2038–2048
Article MATH Google Scholar

Download references

Acknowledgments

We thank the anonymous reviewers whose constructive comments helped improve the paper substantially. We also wish to thank Dr. Alexander B. Zwart for his helpful comments on refining this paper. This paper’s revision was partly conducted when Guansong Pang was a visiting student in the Web Sciences Center at the University of Electronic Science and Technology of China. He would like to thank his supervisor Prof. Mingsheng Shang in the Web Sciences Center for the support on this work. This work was supported in part by the National Natural Science Foundation of China under Grant No. 61070061 and No.61202271, and by the National Social Science Foundation of China under Grant No. 13CGL130.

Author information

Authors and Affiliations

Clayton School of Information Technology, Monash University, Melbourne, VIC, 3800, Australia
Guansong Pang
CSIRO Computational Informatics, GPO Box 664, Canberra, ACT, 2601, Australia
Huidong Jin
School of Informatics, Guangdong University of Foreign Studies, Guangzhou, 510006, China
Shengyi Jiang

Authors

Guansong Pang
View author publications
You can also search for this author in PubMed Google Scholar
Huidong Jin
View author publications
You can also search for this author in PubMed Google Scholar
Shengyi Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guansong Pang.

Additional information

Responsible editor: Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, Filip Zelezny.

This work was mainly done when Guansong Pang was with Guangdong University of Foreign Studies, China.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pang, G., Jin, H. & Jiang, S. CenKNN: a scalable and effective text classifier. Data Min Knowl Disc 29, 593–625 (2015). https://doi.org/10.1007/s10618-014-0358-x

Download citation

Received: 03 March 2013
Accepted: 27 May 2014
Published: 03 July 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s10618-014-0358-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

CenKNN: a scalable and effective text classifier

Abstract

Access this article

Similar content being viewed by others

The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model

Information-theoretic term weighting schemes for document clustering and classification

Semantic Classifier Approach to Document Classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CenKNN: a scalable and effective text classifier

Abstract

Access this article

Similar content being viewed by others

The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model

Information-theoretic term weighting schemes for document clustering and classification

Semantic Classifier Approach to Document Classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation