Improving css-KNN Classification Performance by Shifts in Training Data

Draszawka, Karol; Szymański, Julian; Guerra, Francesco

doi:10.1007/978-3-319-27932-9_5

Improving css-KNN Classification Performance by Shifts in Training Data

Karol Draszawka¹⁹,
Julian Szymański¹⁹ &
Francesco Guerra²⁰

Conference paper
First Online: 07 January 2016

473 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9398))

Abstract

This paper presents a new approach to improve the performance of a css-k-NN classifier for categorization of text documents. The css-k-NN classifier (i.e., a threshold-based variation of a standard k-NN classifier we proposed in [1]) is a lazy-learning instance-based classifier. It does not have parameters associated with features and/or classes of objects, that would be optimized during off-line learning. In this paper we propose a training data preprocessing phase that tries to alleviate the lack of learning. The idea is to compute training data modifications, such that class representative instances are optimized before the actual k-NN algorithm is employed. The empirical text classification experiments using mid-size Wikipedia data sets show that carefully cross-validated settings of such preprocessing yields significant improvements in k-NN performance compared to classification without this step. The proposed approach can be useful for improving the effectivenes of other classifiers as well as it can find applications in domain of recommendation systems and keyword-based search.

This is a preview of subscription content, log in via an institution.

References

Draszawka, K., Szymanski, J.: Thresholding strategies for large scale multi-label text classifier. In: The 6th International Conference on Human System Interaction (HSI), 2013, pp. 350–355. IEEE (2013)
Google Scholar
Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML, vol. 99, pp. 200–209 (1999)
Google Scholar
McCallum, A., Nigam, K., et al.: A comparison of event models for naive bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48. Citeseer (1998)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34, 1–47 (2002)
Article Google Scholar
Westa, M., Szymański, J., Krawczyk, H.: Text classifiers for automatic articles categorization. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2012, Part II. LNCS, vol. 7268, pp. 196–204. Springer, Heidelberg (2012)
Chapter Google Scholar
Tan, S.: Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst. Appl. 28, 667–671 (2005)
Article Google Scholar
Wang, X., Zhao, H., Lu, B.: Enhanced k-nearest neighbour algorithm for largescale hierarchical multi-label classification. In: Proceedings of the Joint ECML/PKDD PASCAL Workshop on Large-Scale Hierarchical Classification, Athens, Greece, vol. 5 (2011)
Google Scholar
Zhou, Y., Li, Y., Xia, S.: An improved knn text classification algorithm based on clustering. J. Comput. 4, 230–237 (2009)
Google Scholar
Zhang, M.L., Zhou, Z.H.: Ml-knn: a lazy learning approach to multi-label learning. Pattern Recogn. 40, 2038–2048 (2007)
Article Google Scholar
Read, J.: Scalable multi-label classification. Ph.D. thesis, University of Waikato (2010)
Google Scholar
Yu, H., Yang, J., Han, J., Li, X.: Making svms scalable to large data sets using hierarchical cluster indexing. Data Min. Knowl. Disc. 11, 295–321 (2005)
Article Google Scholar
Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: Proceedings of the 18th International Conference on World Wide Web, pp. 211–220. ACM (2009)
Google Scholar
Kaiser, J.: Dealing with missing values in data. J. Syst. Integr. 5, 42–51 (2014)
Article Google Scholar
Grzymała-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W.P., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, p. 378. Springer, Heidelberg (2001)
Chapter Google Scholar
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 2nd edn. Wiley, New York (2002)
Book Google Scholar
Farhangfar, A., Kurgan, L.A., Dy, J.G.: Impact of imputation of missing values on classification error for discrete data. Pattern Recogn. 41, 3692–3705 (2008)
Article Google Scholar
Juan, A., Ney, H.: Reversing and smoothing the multinomial naive bayes text classifier. In: PRIS, pp. 200–212. Citeseer (2002)
Google Scholar
Szymanski, J.: Comparative analysis of text representation methods using classification. Cybern. Syst. 45, 180–199 (2014)
Article Google Scholar
Soucy, P., Mineau, G.W.: Beyond tfidf weighting for text categorization in the vector space model. IJCAI. 5, 1130–1135 (2005)
Google Scholar
Tsoumakas, G., Vlahavas, I.P.: Random k-Labelsets: an ensemble method for multilabel classification. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 406–417. Springer, Heidelberg (2007)
Chapter Google Scholar
Kiritchenko, S., Matwin, S., Nock, R., Famili, A.F.: Learning and evaluation in the presence of class hierarchies: application to text categorization. In: Lamontagne, L., Marchand, M. (eds.) Canadian AI 2006. LNCS (LNAI), vol. 4013, pp. 395–406. Springer, Heidelberg (2006)
Chapter Google Scholar
Bergamaschi, S., Domnori, E., Guerra, F., Trillo-Lado, R., Velegrakis, Y.: Keyword search over relational databases: a metadata approach. In: SIGMOD, pp. 565–576. ACM (2011)
Google Scholar

Download references

Acknowledgement

The authors would like to acknowledge networking support by the ICT COST Action IC1302 KEYSTONE - Semantic keyword-based search on structured data sources (www.keystone-cost.eu).

Author information

Authors and Affiliations

Gdańsk University of Technology, Gdańsk, Poland
Karol Draszawka & Julian Szymański
Universita’ di Modena e Reggio Emilia, Modena, Italy
Francesco Guerra

Authors

Karol Draszawka
View author publications
You can also search for this author in PubMed Google Scholar
Julian Szymański
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Guerra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Karol Draszawka .

Editor information

Editors and Affiliations

University of Coimbra, Coimbra, Portugal
Jorge Cardoso
Huawei European Research Center, Munich, Germany
Jorge Cardoso
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Francesco Guerra
Delft University of Technology, Delft, Zuid-Holland, The Netherlands
Geert-Jan Houben
University of Coimbra, Coimbra, Portugal
Alexandre Miguel Pinto
Università degli Studi di Trento, Trento, Italy
Yannis Velegrakis

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Draszawka, K., Szymański, J., Guerra, F. (2015). Improving css-KNN Classification Performance by Shifts in Training Data . In: Cardoso, J., Guerra, F., Houben, GJ., Pinto, A.M., Velegrakis, Y. (eds) Semantic Keyword-Based Search on Structured Data Sources. IKC 2015. Lecture Notes in Computer Science(), vol 9398. Springer, Cham. https://doi.org/10.1007/978-3-319-27932-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-27932-9_5
Published: 07 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27931-2
Online ISBN: 978-3-319-27932-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics