Abstract
Classification is one of the most central topics in machine learning. Yet, most of the algorithms that solve the classification problem operate under the assumption that the training datasets are balanced. While this assumption is reasonable for many classification problems, it is often not valid. For example, application domains such as fraud and spam detection are characterized by highly unbalanced classes where the examples of malicious items are far less numerous then the benign ones. This paper proposes a KNN-based algorithm adapted to unbalanced classes. The algorithm precomputes distances in the training set as well as a centrality score for every training item. It then weights the distances between the items to be classified and their K-nearest training neighbors, accounting for the distribution of distances in every class and the centrality (and outlierness) of neighbors. This reduces the noise from outliers of the majority class and enhances the weights of central data points allowing the proposed algorithm to achieve high accuracy in addition to high TPR in the minority class.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
M. Galar, A. Fernández, E. Barrenechea, and F. Herrera, “DRCW-OVO : Distance-based relative competence weighting combination for One-vs-One strategy in multi-class problems,” Pattern Recognition, vol. 48, no. 1, pp. 28–42, 2015. [Online]. Available: http://dx.doi.org/10.1016/j.patcog.2014.07.023
M. Graña and E. Corchado, “A survey of multiple classifier systems as hybrid systems,” vol. 16, pp. 3–17, 2014
P. R. Cavalin, R. Sabourin, and C. Y. Suen, “LoGID : An adaptive framework combining local and global incremental learning for dynamic selection of ensembles of HMMs,” Pattern Recognition, vol. 45, no. 9, pp. 3544–3556, 2012. [Online]. Available: http://dx.doi.org/10.1016/j.patcog.2012.02.034
C. Lee, M. Yang, L. Chang, and Z. Lee, “A Hybrid Algorithm Applied to Classify Unbalanced Data,” Vasa, pp. 618–621, 2008. [Online]. Available: http://medcontent.metapress.com/index/A65RM03P4874243N.pdf
J. Zhang and I. Mani, “kNN Approach to Unbalanced Data Distributions: A Case Study involving Information Extraction,” Workshop on Learning from Imbalanced Datasets II ICML Washington DC 2003, pp. 42–48, 2003. [Online]. Available: http://scholar.google.com/scholar?q=intitle:knn+approach+to+unbalanced+data+distributions:+a+case+study+involving+information+extraction
E. Frank and R. R. Bouckaert, “Naive bayes for text classification with unbalanced classes,” PKDD’06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases, pp. 503–510, 2006. [Online]. Available: http://dl.acm.org/citation.cfm?id=2089856.2089908
E. M. Dos, R. Sabourin, and P. Maupin, “A dynamic overproduce-and-choose strategy for the selection of classifier ensembles,” vol. 41, 2008
L. Chen and M. S. Kamel, “MSEBAG : a dynamic classifier ensemble generation based on minimum-sufficient ensemble ’ and bagging,” International Journal of Systems Science, vol. 47, no. 2, pp. 406–419, 2016. [Online]. Available: http://dx.doi.org/10.1080/00207721.2015.1074762
R. M. O. Cruz, R. Sabourin, and G. D. C. Cavalcanti, “Analyzing dynamic ensemble selection techniques using dissimilarity analysis,” IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Springer, Cham, 2014
Bhowan U, Johnston M, Zhang M, Yao X (2013) Evolving diverse ensembles using genetic programming for classification with unbalanced data. IEEE Transactions on Evolutionary Computation 17(3):368–386
A. Ramanan, S. Suppharangsan, and M. Niranjan, “Unbalanced decision trees for multi-class classification,” in ICIIS 2007 - 2nd International Conference on Industrial and Information Systems 2007, Conference Proceedings, 2007, pp. 291–294
Eitrich T, Kless A, Druska C, Meyer W, Grotendorst J (2007) Classification of highly unbalanced CYP450 data of drugs using cost sensitive machine learning techniques. Journal of Chemical Information and Modeling 47(1):92–103
Y. Grandvalet, J. Mariéthoz, and S. Bengio, “A probabilistic interpretation of SVMs with an application to unbalanced classification,” Advances in Neural Information Processing Systems 18 (NIPS 2005), pp. 467–474, 2006. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.102.4555&rank=1
Plant C, Böhm C, Tilg B, Baumgartner C (2006) Enhancing instance-based classification with local density: A new algorithm for classifying unbalanced biomedical data. Bioinformatics 22(8):981–988
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16:321–357
C. Y. Lee and Z. J. Lee, “A novel algorithm applied to classify unbalanced data,” Applied Soft Computing Journal, vol. 12, no. 8, pp. 2481–2485, 2012. [Online]. Available: http://dx.doi.org/10.1016/j.asoc.2012.03.051
T. Padmaja, N. Dhulipalla, R. Bapi, and P. Krishna, “Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection,” 15th International Conference on Advanced Computing and Communications (ADCOM 2007), pp. 511–516, 2007
Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Expert Systems with Applications 33(1):1–5
Tan S (2005) Neighbor-weighted K-nearest neighbor for unbalanced text corpus. Expert Systems with Applications 28(4):667–671
H. Gao, D. Huang, Y. Yang, and S. Li, “Chinese chunking using ESVM-KNN,” 2006 International Conference on Computational Intelligence and Security, ICCIAS 2006, vol. 1, no. 2, pp. 731–734, 2007
J. Zhang, E. Bloedorn, L. Rosen, and D. Venese, “Learning rules from highly unbalanced data sets,” Proc. Fourth IEEE International Conference on Data Mining ICDM ’04, pp. 571–574, 2004
Z. Y.-q. Ou and J.-s. C. Geng, “Dynamic weighting ensemble classifiers based on cross-validation,” pp. 309–317, 2011
Bhowan U, Johnston M, Zhang M (2012) Developing new fitness functions in genetic programming for classification with unbalanced data. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 42(2):406–421
A. D. Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi, “Calibrating probability with undersampling for unbalanced classification,” in Proceedings - 2015 IEEE Symposium Series on Computational Intelligence, SSCI 2015, 2016, pp. 159–166
Sigillito VG, Wing SP, Hutton LV, Baker KB (1989) Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest (Applied Physics Laboratory) 10(3):262–266
Acknowledgements
This work was supported by the French Investment for the future project REQUEST (REcursive QUEry and Scalable Technologies) and the region of Champagne-Ardenne.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Jaafor, O., Birregah, B. (2020). KNN-LC: Classification in Unbalanced Datasets using a KNN-Based Algorithm and Local Centralities. In: Adjallah, K., Birregah, B., Abanda, H. (eds) Data-Driven Modeling for Sustainable Engineering. ICEASSM 2017. Lecture Notes in Networks and Systems, vol 72. Springer, Cham. https://doi.org/10.1007/978-3-030-13697-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-13697-0_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13696-3
Online ISBN: 978-3-030-13697-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)