Skip to main content
Log in

CenKNN: a scalable and effective text classifier

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

A big challenge in text classification is to perform classification on a large-scale and high-dimensional text corpus in the presence of imbalanced class distributions and a large number of irrelevant or noisy term features. A number of techniques have been proposed to handle this challenge with varying degrees of success. In this paper, by combining the strengths of two widely used text classification techniques, K-Nearest-Neighbor (KNN) and centroid based (Centroid) classifiers, we propose a scalable and effective flat classifier, called CenKNN, to cope with this challenge. CenKNN projects high-dimensional (often hundreds of thousands) documents into a low-dimensional (normally a few dozen) space spanned by class centroids, and then uses the \(k\)-d tree structure to find \(K\) nearest neighbors efficiently. Due to the strong representation power of class centroids, CenKNN overcomes two issues related to existing KNN text classifiers, i.e., sensitivity to imbalanced class distributions and irrelevant or noisy term features. By working on projected low-dimensional data, CenKNN substantially reduces the expensive computation time in KNN. CenKNN also works better than Centroid since it uses all the class centroids to define similarity and works well on complex data, i.e., non-linearly separable data and data with local patterns within each class. A series of experiments on both English and Chinese, benchmark and synthetic corpora demonstrates that although CenKNN works on a significantly lower-dimensional space, it performs substantially better than KNN and its five variants, and existing scalable classifiers, including Centroid and Rocchio. CenKNN is also empirically preferable to another well-known classifier, support vector machines, on highly imbalanced corpora with a small number of classes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. In this paper, classification accuracy is referred to the performance in micro-averaging \(\hbox {F}_{1}\) (denoted by \(microF_{1})\) and macro-averaging \(\hbox {F}_{1}\) (denoted by \({macroF}_{1})\) values, rather than the ratio between the number of correctly classified test documents and the total number of test documents.

  2. By “flat” classifiers, we focus on text classification tasks without considering a class hierarchy. Flat classifiers are the building blocks for successful hierarchical text classifiers.

  3. Our experiments showed that another variant we proposed in Pang and Jiang (2013) had quite similar performance as INNTC (Jiang et al. 2012). So we only compared CenKNN with INNTC rather than both of them.

  4. Reuters-21578 is available at http://archive.ics.uci.edu/ml/databases/reuters21578/.

  5. 20Newsgroup is available at http://qwone.com/~jason/20Newsgroups/.

  6. TanCorp is available at http://www.searchforum.org.cn/tansongbo/corpus.htm.

  7. Fudan University text classification corpus is available at http://www.nlp.org.cn/docs/download.php?doc_id=294.

  8. DMOZ datasets are available at http://lshtc.iit.demokritos.gr/node/3.

  9. This version of Tan12 is available at http://www.scholat.com/vpost.html?pid=3047.

  10. In our significance test, for each classifier per data set, the \(F_{1}\) values for each class of all the classes were used as sample data. The paired-sample \(t\)-test was used to test the null hypothesis that the pairwise difference between the \(F_{1}\) values of two classifiers has a mean equal to zero. 0.05 is used as the statistical significance level throughout this paper.

  11. Hereafter this format refers to the classifier working on a space with a specific dimension.

  12. These synthetic corpora are available at http://www.scholat.com/vpost.html?pid=2395.

  13. The results of CenSVM using different kernels show that non-linear kernels perform better than the linear kernel on the five corpora. This suggests that documents are often not linearly separable in the low-dimensional class-centroid-based space. Our results also show that the RBF kernel outperforms the polynomial kernel. These results have been made available at http://www.scholat.com/portalPaperInfo_Eng.html?paperID=19993&Entry=pines.

  14. These synthetic corpora are available at http://www.scholat.com/vpost.html?pid=2396.

  15. These two subsets are available at http://www.scholat.com/vpost.html?pid=4038.

References

  • Achlioptas D (2003) Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687

    Article  MATH  MathSciNet  Google Scholar 

  • Aggarwal CC, Zhai C (2012) A survey of text classification algorithms. Mining text data. Springer, New York

    Chapter  Google Scholar 

  • Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29

    Article  Google Scholar 

  • Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517

    Article  MATH  MathSciNet  Google Scholar 

  • Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp. 245–250 (2001)

  • Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27

    Google Scholar 

  • Chen Y, Hung Y, Yen T, Fuh C (2007) Fast and versatile algorithm for nearest neighbor search based on a lower bound tree. Pattern Recognit 40(2):360–375

    Article  MATH  Google Scholar 

  • Cunningham P, Delany SJ (2007) k-Nearest neighbour classifiers. Dublin: Technical Report UCD-CSI-2007-4

  • Du L, Buntine W, Jin H (2010) A segmented topic model based on the two-parameter Poisson–Dirichlet process. Mach Learn 81(1):5–19

    Article  MathSciNet  Google Scholar 

  • Du L, Buntine W, Jin H, Chen C (2012) Sequential latent Dirichlet allocation. Knowl Inf Syst 31(3):475–503

    Article  Google Scholar 

  • Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305

    MATH  Google Scholar 

  • Guan H, Zhou J, Guo M (2009) A class-feature-centroid classifier for text categorization. In: Proceedings of the 18th international conference on World Wide Web, pp. 201–210 (2009)

  • Guo G, Wang H, Bell D, Bi Y, Greer K (2006) Using kNN model for automatic text categorization. Soft Comput 10(5):423–430

    Article  Google Scholar 

  • Han EH, Karypis G (2000) Centroid-based document classification: analysis and experimental results. In: Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, pp. 116–123 (2000)

  • Han E, Karypis G, Kumar V (2001) Text categorization using weight adjusted k-nearest neighbor classification. In: Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 53–65

  • Han X, Li S, Shen Z (2012) A k-NN method for large scale hierarchical text classification at LSHTC3. In: Third Pascal Challenge on Large Scale Hierarchical Text classification (2012)

  • He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  • Jagadish HV, Ooi BC, Tan K, Yu C, Zhang R (2005) iDistance: an adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans Database Syst (TODS) 30(2):364–397

    Article  Google Scholar 

  • Jiang S, Pang G, Wu M, Kuang L (2012) An improved K-nearest-neighbor algorithm for text categorization. Expert Syst Appl 39(1):1503–1509

    Article  Google Scholar 

  • Joachims T (1996) A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 143–151 (1996)

  • Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning, pp. 137–142

  • Joachims T (2001) A statistical learning learning model of text classification for support vector machines. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, pp. 128–136 (2001)

  • Katayama N, Satoh S (1997) The SR-tree: an index structure for high-dimensional nearest neighbor queries. ACM SIGMOD Rec 26:369–380

    Article  Google Scholar 

  • Kim H, Howland P, Park H (2005) Dimension reduction in text classification with support vector machines. J Mach Learn Res 6:37–53

    MATH  MathSciNet  Google Scholar 

  • Kosmopoulos A, Gaussier E, Paliouras G, Aseervatham S (2010) The ECIR 2010 large scale hierarchical classification workshop. ACM SIGIR Forum 44(1):23–32

    Article  Google Scholar 

  • Lam W, Han Y (2003) Automatic textual document categorization based on generalized instance sets and a metamodel. IEEE Trans Pattern Anal Mach Intell 25(5):628–633

    Article  Google Scholar 

  • Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735

    Article  Google Scholar 

  • Lin J, Gunopulos D (2003) Dimensionality reduction by random projection and latent semantic indexing. In: Proceedings of SDM’2003 Workshop on Text Mining Workshop (2003)

  • Liu T, Chen Z, Zhang B, Ma W, Wu G (2004) Improving text classification using local latent semantic indexing. In: Proceedings of the 4th IEEE International Conference on Data Mining, pp. 162–169 (2004)

  • Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of ICML’2003 Workshop on Learning from Imbalanced Datasets (2003)

  • Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Miao Y., Qiu X. Hierarchical centroid-based classifier for large scale text classification. In: First Pascal Challenge on Large Scale Hierarchical Text classification (2009).

  • Moore AW, Hall T (1990) Efficient memory-based learning for robot control. Doctoral dissertation, University of Cambridge (1990)

  • Pang G, Jiang S (2013) A generalized cluster centroid based classifier for text categorization. Inf Process Manag 49(2):576–586

    Article  MathSciNet  Google Scholar 

  • Pang G, Jiang S, Chen D (2013) A simple integration of social relationship and text data for identifying potential customers in microblogging. Advanced data mining and applications. Springer, Berlin

    Google Scholar 

  • Papadimitriou CH, Tamaki H, Raghavan P, Vempala S (1998) Latent semantic indexing: a probabilistic analysis. In: Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, pp. 159–168 (1998)

  • Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  • Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47

    Article  Google Scholar 

  • Sun JT, Chen Z, Zeng HJ, Lu YC, Shi CY, Ma WY (2004) Supervised latent semantic indexing for document categorization. In: Proceedings of the 4th IEEE International Conference on Data Mining, pp. 535–538 (2004)

  • Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recog Artif Intell 23(4):687–719

    Article  Google Scholar 

  • Tan S (2005) Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst Appl 28(4):667–671

    Article  Google Scholar 

  • Tan S (2006) An effective refinement strategy for KNN text classifier. Expert Syst Appl 30(2):290–298

    Article  Google Scholar 

  • Tan S, Cheng X (2007) An effective approach to enhance centroid classifier for text categorization. In: Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases, pp. 581–588 (2007)

  • Tang L, Liu H (2005) Bias analysis in text classification for highly skewed data. In: Proceedings of the 5th IEEE International Conference on Data Mining, pp. 781–784 (2005)

  • Vilalta R, Achari M, Eick CF (2003) Class decomposition via clustering: a new framework for low-variance classifiers. In: Proceedings of the 3rd IEEE International Conference on Data Mining, pp. 673–676 (2003)

  • Wan CH, Lee LH, Rajkumar R, Isa D (2012) A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine. Expert Syst Appl 39(15):11880–11888

    Article  Google Scholar 

  • Wang X, Zhao H, Lu B (2011) Enhance k-nearest neighbour algorithm for large-scale multi-labeled hierarchical classification. In: Second Pascal Challenge on Large Scale Hierarchical Text classification (2011)

  • Wang X, Zhao H, Lu B (2013) A Meta-Top-down Method for Large-scale Hierarchical Classification. IEEE Trans Knowl Data Eng, 99 (2013). doi:10.1109/TKDE.2013.30

  • Wettschereck D, Aha DW, Mohri T (1997) A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artif Intell Rev 11(1–5):273–314

    Article  Google Scholar 

  • Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, Mclachlan GJ, Ng A, Liu B, Yu PS (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  • Yang Y (1994) Expert network: effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, pp. 13–22 (1994)

  • Yang Y, Ault T, Pierce T, Lattimer CW (2000) Improving text categorization methods for event tracking. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pp. 65–72

  • Yang H, King I (2009) Sprinkled latent semantic indexing for text classification with background knowledge. Lect Notes Comput Sci 5507:53–60

    Article  Google Scholar 

  • Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, pp. 42–49

  • Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 412–420

  • Zhang M, Zhou Z (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit 40(7):2038–2048

    Article  MATH  Google Scholar 

Download references

Acknowledgments

We thank the anonymous reviewers whose constructive comments helped improve the paper substantially. We also wish to thank Dr. Alexander B. Zwart for his helpful comments on refining this paper. This paper’s revision was partly conducted when Guansong Pang was a visiting student in the Web Sciences Center at the University of Electronic Science and Technology of China. He would like to thank his supervisor Prof. Mingsheng Shang in the Web Sciences Center for the support on this work. This work was supported in part by the National Natural Science Foundation of China under Grant No. 61070061 and No.61202271, and by the National Social Science Foundation of China under Grant No. 13CGL130.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guansong Pang.

Additional information

Responsible editor: Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, Filip Zelezny.

This work was mainly done when Guansong Pang was with Guangdong University of Foreign Studies, China.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pang, G., Jin, H. & Jiang, S. CenKNN: a scalable and effective text classifier. Data Min Knowl Disc 29, 593–625 (2015). https://doi.org/10.1007/s10618-014-0358-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0358-x

Keywords

Navigation