Abstract
This paper presents a new approach to improve the performance of a css-k-NN classifier for categorization of text documents. The css-k-NN classifier (i.e., a threshold-based variation of a standard k-NN classifier we proposed in [1]) is a lazy-learning instance-based classifier. It does not have parameters associated with features and/or classes of objects, that would be optimized during off-line learning. In this paper we propose a training data preprocessing phase that tries to alleviate the lack of learning. The idea is to compute training data modifications, such that class representative instances are optimized before the actual k-NN algorithm is employed. The empirical text classification experiments using mid-size Wikipedia data sets show that carefully cross-validated settings of such preprocessing yields significant improvements in k-NN performance compared to classification without this step. The proposed approach can be useful for improving the effectivenes of other classifiers as well as it can find applications in domain of recommendation systems and keyword-based search.
This is a preview of subscription content, log in via an institution.
References
Draszawka, K., Szymanski, J.: Thresholding strategies for large scale multi-label text classifier. In: The 6th International Conference on Human System Interaction (HSI), 2013, pp. 350–355. IEEE (2013)
Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML, vol. 99, pp. 200–209 (1999)
McCallum, A., Nigam, K., et al.: A comparison of event models for naive bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48. Citeseer (1998)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34, 1–47 (2002)
Westa, M., Szymański, J., Krawczyk, H.: Text classifiers for automatic articles categorization. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2012, Part II. LNCS, vol. 7268, pp. 196–204. Springer, Heidelberg (2012)
Tan, S.: Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst. Appl. 28, 667–671 (2005)
Wang, X., Zhao, H., Lu, B.: Enhanced k-nearest neighbour algorithm for largescale hierarchical multi-label classification. In: Proceedings of the Joint ECML/PKDD PASCAL Workshop on Large-Scale Hierarchical Classification, Athens, Greece, vol. 5 (2011)
Zhou, Y., Li, Y., Xia, S.: An improved knn text classification algorithm based on clustering. J. Comput. 4, 230–237 (2009)
Zhang, M.L., Zhou, Z.H.: Ml-knn: a lazy learning approach to multi-label learning. Pattern Recogn. 40, 2038–2048 (2007)
Read, J.: Scalable multi-label classification. Ph.D. thesis, University of Waikato (2010)
Yu, H., Yang, J., Han, J., Li, X.: Making svms scalable to large data sets using hierarchical cluster indexing. Data Min. Knowl. Disc. 11, 295–321 (2005)
Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: Proceedings of the 18th International Conference on World Wide Web, pp. 211–220. ACM (2009)
Kaiser, J.: Dealing with missing values in data. J. Syst. Integr. 5, 42–51 (2014)
Grzymała-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, p. 378. Springer, Heidelberg (2001)
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 2nd edn. Wiley, New York (2002)
Farhangfar, A., Kurgan, L.A., Dy, J.G.: Impact of imputation of missing values on classification error for discrete data. Pattern Recogn. 41, 3692–3705 (2008)
Juan, A., Ney, H.: Reversing and smoothing the multinomial naive bayes text classifier. In: PRIS, pp. 200–212. Citeseer (2002)
Szymanski, J.: Comparative analysis of text representation methods using classification. Cybern. Syst. 45, 180–199 (2014)
Soucy, P., Mineau, G.W.: Beyond tfidf weighting for text categorization in the vector space model. IJCAI. 5, 1130–1135 (2005)
Tsoumakas, G., Vlahavas, I.P.: Random k-Labelsets: an ensemble method for multilabel classification. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 406–417. Springer, Heidelberg (2007)
Kiritchenko, S., Matwin, S., Nock, R., Famili, A.F.: Learning and evaluation in the presence of class hierarchies: application to text categorization. In: Lamontagne, L., Marchand, M. (eds.) Canadian AI 2006. LNCS (LNAI), vol. 4013, pp. 395–406. Springer, Heidelberg (2006)
Bergamaschi, S., Domnori, E., Guerra, F., Trillo-Lado, R., Velegrakis, Y.: Keyword search over relational databases: a metadata approach. In: SIGMOD, pp. 565–576. ACM (2011)
Acknowledgement
The authors would like to acknowledge networking support by the ICT COST Action IC1302 KEYSTONE - Semantic keyword-based search on structured data sources (www.keystone-cost.eu).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Draszawka, K., Szymański, J., Guerra, F. (2015). Improving css-KNN Classification Performance by Shifts in Training Data . In: Cardoso, J., Guerra, F., Houben, GJ., Pinto, A.M., Velegrakis, Y. (eds) Semantic Keyword-Based Search on Structured Data Sources. IKC 2015. Lecture Notes in Computer Science(), vol 9398. Springer, Cham. https://doi.org/10.1007/978-3-319-27932-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-27932-9_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27931-2
Online ISBN: 978-3-319-27932-9
eBook Packages: Computer ScienceComputer Science (R0)