KNN-LC: Classification in Unbalanced Datasets using a KNN-Based Algorithm and Local Centralities

Jaafor, Omar; Birregah, Babiga

doi:10.1007/978-3-030-13697-0_7

Omar Jaafor¹² &
Babiga Birregah¹²

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 72))

Included in the following conference series:

International Conference on Engineering, Applied Sciences and System Modeling

530 Accesses
5 Citations

Abstract

Classification is one of the most central topics in machine learning. Yet, most of the algorithms that solve the classification problem operate under the assumption that the training datasets are balanced. While this assumption is reasonable for many classification problems, it is often not valid. For example, application domains such as fraud and spam detection are characterized by highly unbalanced classes where the examples of malicious items are far less numerous then the benign ones. This paper proposes a KNN-based algorithm adapted to unbalanced classes. The algorithm precomputes distances in the training set as well as a centrality score for every training item. It then weights the distances between the items to be classified and their K-nearest training neighbors, accounting for the distribution of distances in every class and the centrality (and outlierness) of neighbors. This reduces the noise from outliers of the majority class and enhances the weights of central data points allowing the proposed algorithm to achieve high accuracy in addition to high TPR in the minority class.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 219.00; Price excludes VAT (USA)

Softcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

M. Galar, A. Fernández, E. Barrenechea, and F. Herrera, “DRCW-OVO : Distance-based relative competence weighting combination for One-vs-One strategy in multi-class problems,” Pattern Recognition, vol. 48, no. 1, pp. 28–42, 2015. [Online]. Available: http://dx.doi.org/10.1016/j.patcog.2014.07.023
M. Graña and E. Corchado, “A survey of multiple classifier systems as hybrid systems,” vol. 16, pp. 3–17, 2014
Google Scholar
P. R. Cavalin, R. Sabourin, and C. Y. Suen, “LoGID : An adaptive framework combining local and global incremental learning for dynamic selection of ensembles of HMMs,” Pattern Recognition, vol. 45, no. 9, pp. 3544–3556, 2012. [Online]. Available: http://dx.doi.org/10.1016/j.patcog.2012.02.034
C. Lee, M. Yang, L. Chang, and Z. Lee, “A Hybrid Algorithm Applied to Classify Unbalanced Data,” Vasa, pp. 618–621, 2008. [Online]. Available: http://medcontent.metapress.com/index/A65RM03P4874243N.pdf
J. Zhang and I. Mani, “kNN Approach to Unbalanced Data Distributions: A Case Study involving Information Extraction,” Workshop on Learning from Imbalanced Datasets II ICML Washington DC 2003, pp. 42–48, 2003. [Online]. Available: http://scholar.google.com/scholar?q=intitle:knn+approach+to+unbalanced+data+distributions:+a+case+study+involving+information+extraction
E. Frank and R. R. Bouckaert, “Naive bayes for text classification with unbalanced classes,” PKDD’06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases, pp. 503–510, 2006. [Online]. Available: http://dl.acm.org/citation.cfm?id=2089856.2089908
E. M. Dos, R. Sabourin, and P. Maupin, “A dynamic overproduce-and-choose strategy for the selection of classifier ensembles,” vol. 41, 2008
Google Scholar
L. Chen and M. S. Kamel, “MSEBAG : a dynamic classifier ensemble generation based on minimum-sufficient ensemble ’ and bagging,” International Journal of Systems Science, vol. 47, no. 2, pp. 406–419, 2016. [Online]. Available: http://dx.doi.org/10.1080/00207721.2015.1074762
R. M. O. Cruz, R. Sabourin, and G. D. C. Cavalcanti, “Analyzing dynamic ensemble selection techniques using dissimilarity analysis,” IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Springer, Cham, 2014
Chapter Google Scholar
Bhowan U, Johnston M, Zhang M, Yao X (2013) Evolving diverse ensembles using genetic programming for classification with unbalanced data. IEEE Transactions on Evolutionary Computation 17(3):368–386
Article Google Scholar
A. Ramanan, S. Suppharangsan, and M. Niranjan, “Unbalanced decision trees for multi-class classification,” in ICIIS 2007 - 2nd International Conference on Industrial and Information Systems 2007, Conference Proceedings, 2007, pp. 291–294
Google Scholar
Eitrich T, Kless A, Druska C, Meyer W, Grotendorst J (2007) Classification of highly unbalanced CYP450 data of drugs using cost sensitive machine learning techniques. Journal of Chemical Information and Modeling 47(1):92–103
Article Google Scholar
Y. Grandvalet, J. Mariéthoz, and S. Bengio, “A probabilistic interpretation of SVMs with an application to unbalanced classification,” Advances in Neural Information Processing Systems 18 (NIPS 2005), pp. 467–474, 2006. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.102.4555&rank=1
Plant C, Böhm C, Tilg B, Baumgartner C (2006) Enhancing instance-based classification with local density: A new algorithm for classifying unbalanced biomedical data. Bioinformatics 22(8):981–988
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16:321–357
Article Google Scholar
C. Y. Lee and Z. J. Lee, “A novel algorithm applied to classify unbalanced data,” Applied Soft Computing Journal, vol. 12, no. 8, pp. 2481–2485, 2012. [Online]. Available: http://dx.doi.org/10.1016/j.asoc.2012.03.051
T. Padmaja, N. Dhulipalla, R. Bapi, and P. Krishna, “Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection,” 15th International Conference on Advanced Computing and Communications (ADCOM 2007), pp. 511–516, 2007
Google Scholar
Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Expert Systems with Applications 33(1):1–5
Article Google Scholar
Tan S (2005) Neighbor-weighted K-nearest neighbor for unbalanced text corpus. Expert Systems with Applications 28(4):667–671
Article Google Scholar
H. Gao, D. Huang, Y. Yang, and S. Li, “Chinese chunking using ESVM-KNN,” 2006 International Conference on Computational Intelligence and Security, ICCIAS 2006, vol. 1, no. 2, pp. 731–734, 2007
Google Scholar
J. Zhang, E. Bloedorn, L. Rosen, and D. Venese, “Learning rules from highly unbalanced data sets,” Proc. Fourth IEEE International Conference on Data Mining ICDM ’04, pp. 571–574, 2004
Google Scholar
Z. Y.-q. Ou and J.-s. C. Geng, “Dynamic weighting ensemble classifiers based on cross-validation,” pp. 309–317, 2011
Google Scholar
Bhowan U, Johnston M, Zhang M (2012) Developing new fitness functions in genetic programming for classification with unbalanced data. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 42(2):406–421
Article Google Scholar
A. D. Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi, “Calibrating probability with undersampling for unbalanced classification,” in Proceedings - 2015 IEEE Symposium Series on Computational Intelligence, SSCI 2015, 2016, pp. 159–166
Google Scholar
Sigillito VG, Wing SP, Hutton LV, Baker KB (1989) Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest (Applied Physics Laboratory) 10(3):262–266
Google Scholar

Download references

Acknowledgements

This work was supported by the French Investment for the future project REQUEST (REcursive QUEry and Scalable Technologies) and the region of Champagne-Ardenne.

Author information

Authors and Affiliations

Charles Delaunay Institute UMR CNRS 6281, University of Technology of Troyes, Troyes, France
Omar Jaafor & Babiga Birregah

Authors

Omar Jaafor
View author publications
You can also search for this author in PubMed Google Scholar
Babiga Birregah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Omar Jaafor .

Editor information

Editors and Affiliations

LCOMS/ENIM, University of Lorraine, Metz, France
Kondo H. Adjallah
Institut Charles Delaunay, University of Technology of Troyes, Troyes, France
Babiga Birregah
School of the Built Environment, Oxford Brookes University, Oxford, Oxfordshire, UK
Henry Fonbeyin Abanda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jaafor, O., Birregah, B. (2020). KNN-LC: Classification in Unbalanced Datasets using a KNN-Based Algorithm and Local Centralities. In: Adjallah, K., Birregah, B., Abanda, H. (eds) Data-Driven Modeling for Sustainable Engineering. ICEASSM 2017. Lecture Notes in Networks and Systems, vol 72. Springer, Cham. https://doi.org/10.1007/978-3-030-13697-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-13697-0_7
Published: 22 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13696-3
Online ISBN: 978-3-030-13697-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics