Skip to main content

KNN-LC: Classification in Unbalanced Datasets using a KNN-Based Algorithm and Local Centralities

  • Conference paper
  • First Online:
Data-Driven Modeling for Sustainable Engineering (ICEASSM 2017)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 72))

Abstract

Classification is one of the most central topics in machine learning. Yet, most of the algorithms that solve the classification problem operate under the assumption that the training datasets are balanced. While this assumption is reasonable for many classification problems, it is often not valid. For example, application domains such as fraud and spam detection are characterized by highly unbalanced classes where the examples of malicious items are far less numerous then the benign ones. This paper proposes a KNN-based algorithm adapted to unbalanced classes. The algorithm precomputes distances in the training set as well as a centrality score for every training item. It then weights the distances between the items to be classified and their K-nearest training neighbors, accounting for the distribution of distances in every class and the centrality (and outlierness) of neighbors. This reduces the noise from outliers of the majority class and enhances the weights of central data points allowing the proposed algorithm to achieve high accuracy in addition to high TPR in the minority class.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 279.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. M. Galar, A. Fernández, E. Barrenechea, and F. Herrera, “DRCW-OVO : Distance-based relative competence weighting combination for One-vs-One strategy in multi-class problems,” Pattern Recognition, vol. 48, no. 1, pp. 28–42, 2015. [Online]. Available: http://dx.doi.org/10.1016/j.patcog.2014.07.023

  2. M. Graña and E. Corchado, “A survey of multiple classifier systems as hybrid systems,” vol. 16, pp. 3–17, 2014

    Google Scholar 

  3. P. R. Cavalin, R. Sabourin, and C. Y. Suen, “LoGID : An adaptive framework combining local and global incremental learning for dynamic selection of ensembles of HMMs,” Pattern Recognition, vol. 45, no. 9, pp. 3544–3556, 2012. [Online]. Available: http://dx.doi.org/10.1016/j.patcog.2012.02.034

  4. C. Lee, M. Yang, L. Chang, and Z. Lee, “A Hybrid Algorithm Applied to Classify Unbalanced Data,” Vasa, pp. 618–621, 2008. [Online]. Available: http://medcontent.metapress.com/index/A65RM03P4874243N.pdf

  5. J. Zhang and I. Mani, “kNN Approach to Unbalanced Data Distributions: A Case Study involving Information Extraction,” Workshop on Learning from Imbalanced Datasets II ICML Washington DC 2003, pp. 42–48, 2003. [Online]. Available: http://scholar.google.com/scholar?q=intitle:knn+approach+to+unbalanced+data+distributions:+a+case+study+involving+information+extraction

  6. E. Frank and R. R. Bouckaert, “Naive bayes for text classification with unbalanced classes,” PKDD’06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases, pp. 503–510, 2006. [Online]. Available: http://dl.acm.org/citation.cfm?id=2089856.2089908

  7. E. M. Dos, R. Sabourin, and P. Maupin, “A dynamic overproduce-and-choose strategy for the selection of classifier ensembles,” vol. 41, 2008

    Google Scholar 

  8. L. Chen and M. S. Kamel, “MSEBAG : a dynamic classifier ensemble generation based on minimum-sufficient ensemble ’ and bagging,” International Journal of Systems Science, vol. 47, no. 2, pp. 406–419, 2016. [Online]. Available: http://dx.doi.org/10.1080/00207721.2015.1074762

  9. R. M. O. Cruz, R. Sabourin, and G. D. C. Cavalcanti, “Analyzing dynamic ensemble selection techniques using dissimilarity analysis,” IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Springer, Cham, 2014

    Chapter  Google Scholar 

  10. Bhowan U, Johnston M, Zhang M, Yao X (2013) Evolving diverse ensembles using genetic programming for classification with unbalanced data. IEEE Transactions on Evolutionary Computation 17(3):368–386

    Article  Google Scholar 

  11. A. Ramanan, S. Suppharangsan, and M. Niranjan, “Unbalanced decision trees for multi-class classification,” in ICIIS 2007 - 2nd International Conference on Industrial and Information Systems 2007, Conference Proceedings, 2007, pp. 291–294

    Google Scholar 

  12. Eitrich T, Kless A, Druska C, Meyer W, Grotendorst J (2007) Classification of highly unbalanced CYP450 data of drugs using cost sensitive machine learning techniques. Journal of Chemical Information and Modeling 47(1):92–103

    Article  Google Scholar 

  13. Y. Grandvalet, J. Mariéthoz, and S. Bengio, “A probabilistic interpretation of SVMs with an application to unbalanced classification,” Advances in Neural Information Processing Systems 18 (NIPS 2005), pp. 467–474, 2006. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.102.4555&rank=1

  14. Plant C, Böhm C, Tilg B, Baumgartner C (2006) Enhancing instance-based classification with local density: A new algorithm for classifying unbalanced biomedical data. Bioinformatics 22(8):981–988

    Article  Google Scholar 

  15. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16:321–357

    Article  Google Scholar 

  16. C. Y. Lee and Z. J. Lee, “A novel algorithm applied to classify unbalanced data,” Applied Soft Computing Journal, vol. 12, no. 8, pp. 2481–2485, 2012. [Online]. Available: http://dx.doi.org/10.1016/j.asoc.2012.03.051

  17. T. Padmaja, N. Dhulipalla, R. Bapi, and P. Krishna, “Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection,” 15th International Conference on Advanced Computing and Communications (ADCOM 2007), pp. 511–516, 2007

    Google Scholar 

  18. Shang W, Huang H, Zhu H, Lin Y, Qu Y, Wang Z (2007) A novel feature selection algorithm for text categorization. Expert Systems with Applications 33(1):1–5

    Article  Google Scholar 

  19. Tan S (2005) Neighbor-weighted K-nearest neighbor for unbalanced text corpus. Expert Systems with Applications 28(4):667–671

    Article  Google Scholar 

  20. H. Gao, D. Huang, Y. Yang, and S. Li, “Chinese chunking using ESVM-KNN,” 2006 International Conference on Computational Intelligence and Security, ICCIAS 2006, vol. 1, no. 2, pp. 731–734, 2007

    Google Scholar 

  21. J. Zhang, E. Bloedorn, L. Rosen, and D. Venese, “Learning rules from highly unbalanced data sets,” Proc. Fourth IEEE International Conference on Data Mining ICDM ’04, pp. 571–574, 2004

    Google Scholar 

  22. Z. Y.-q. Ou and J.-s. C. Geng, “Dynamic weighting ensemble classifiers based on cross-validation,” pp. 309–317, 2011

    Google Scholar 

  23. Bhowan U, Johnston M, Zhang M (2012) Developing new fitness functions in genetic programming for classification with unbalanced data. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 42(2):406–421

    Article  Google Scholar 

  24. A. D. Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi, “Calibrating probability with undersampling for unbalanced classification,” in Proceedings - 2015 IEEE Symposium Series on Computational Intelligence, SSCI 2015, 2016, pp. 159–166

    Google Scholar 

  25. Sigillito VG, Wing SP, Hutton LV, Baker KB (1989) Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest (Applied Physics Laboratory) 10(3):262–266

    Google Scholar 

Download references

Acknowledgements

This work was supported by the French Investment for the future project REQUEST (REcursive QUEry and Scalable Technologies) and the region of Champagne-Ardenne.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Omar Jaafor .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jaafor, O., Birregah, B. (2020). KNN-LC: Classification in Unbalanced Datasets using a KNN-Based Algorithm and Local Centralities. In: Adjallah, K., Birregah, B., Abanda, H. (eds) Data-Driven Modeling for Sustainable Engineering. ICEASSM 2017. Lecture Notes in Networks and Systems, vol 72. Springer, Cham. https://doi.org/10.1007/978-3-030-13697-0_7

Download citation

Publish with us

Policies and ethics