Improving css-KNN Classification Performance by Shifts in Training Data
This paper presents a new approach to improve the performance of a css-k-NN classifier for categorization of text documents. The css-k-NN classifier (i.e., a threshold-based variation of a standard k-NN classifier we proposed in ) is a lazy-learning instance-based classifier. It does not have parameters associated with features and/or classes of objects, that would be optimized during off-line learning. In this paper we propose a training data preprocessing phase that tries to alleviate the lack of learning. The idea is to compute training data modifications, such that class representative instances are optimized before the actual k-NN algorithm is employed. The empirical text classification experiments using mid-size Wikipedia data sets show that carefully cross-validated settings of such preprocessing yields significant improvements in k-NN performance compared to classification without this step. The proposed approach can be useful for improving the effectivenes of other classifiers as well as it can find applications in domain of recommendation systems and keyword-based search.
KeywordsKNN classifier Wikipedia Documents classification Missing data imputation
The authors would like to acknowledge networking support by the ICT COST Action IC1302 KEYSTONE - Semantic keyword-based search on structured data sources (www.keystone-cost.eu).
- 1.Draszawka, K., Szymanski, J.: Thresholding strategies for large scale multi-label text classifier. In: The 6th International Conference on Human System Interaction (HSI), 2013, pp. 350–355. IEEE (2013)Google Scholar
- 2.Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML, vol. 99, pp. 200–209 (1999)Google Scholar
- 3.McCallum, A., Nigam, K., et al.: A comparison of event models for naive bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48. Citeseer (1998)Google Scholar
- 7.Wang, X., Zhao, H., Lu, B.: Enhanced k-nearest neighbour algorithm for largescale hierarchical multi-label classification. In: Proceedings of the Joint ECML/PKDD PASCAL Workshop on Large-Scale Hierarchical Classification, Athens, Greece, vol. 5 (2011)Google Scholar
- 8.Zhou, Y., Li, Y., Xia, S.: An improved knn text classification algorithm based on clustering. J. Comput. 4, 230–237 (2009)Google Scholar
- 10.Read, J.: Scalable multi-label classification. Ph.D. thesis, University of Waikato (2010)Google Scholar
- 12.Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: Proceedings of the 18th International Conference on World Wide Web, pp. 211–220. ACM (2009)Google Scholar
- 17.Juan, A., Ney, H.: Reversing and smoothing the multinomial naive bayes text classifier. In: PRIS, pp. 200–212. Citeseer (2002)Google Scholar
- 19.Soucy, P., Mineau, G.W.: Beyond tfidf weighting for text categorization in the vector space model. IJCAI. 5, 1130–1135 (2005)Google Scholar
- 22.Bergamaschi, S., Domnori, E., Guerra, F., Trillo-Lado, R., Velegrakis, Y.: Keyword search over relational databases: a metadata approach. In: SIGMOD, pp. 565–576. ACM (2011)Google Scholar
Open Access This chapter is distributed under the terms of the Creative Commons Attribution Noncommercial License, which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.