Skip to main content

An Empirical Comparative Study of Novel Clustering Algorithms for Class Imbalance Learning

  • Conference paper
  • First Online:
  • 1315 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 380))

Abstract

Data mining is the process of discovering knowledge from the vast data sources. In Data mining, classification and clustering are the two broad branches of study. In Clustering, K-means algorithm is one of the bench mark algorithms used for numerous applications. The popularity of k-means algorithm is due to its efficient and low usage of memory. One of the short comings of k-means algorithm is degradation of performance, when applied to imbalance distributed data. The results of cluster size generated by k-means are relatively uniform, in spite of the input data with non-uniform cluster sizes, which is defined as “uniform effect” in the literature. This paper proposes several novel algorithms to solve the above said problem. The proposed algorithms are compared with each other. The experiments conducted with the proposed algorithm on eleven UCI datasets with evaluation metrics show that proposed algorithms are effective to solve the problem of “uniform effect.”

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Xiong, H., Wu, J.J., Chen, J.: K-means clustering versus validation measures: A data-distribution perspective. IEEE Trans. Syst. Man Cybern. B Cybern. 39(2), 318–331 (2009)

    Article  Google Scholar 

  2. Lu, W.-Z., Wang, D.: Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Sci. Total. Environ. 395(2–3), 109–116 (2008)

    Article  Google Scholar 

  3. Huang, Y.-M., Hung, C.-M., Jiau, H.C.: Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Anal. R. World Appl. 7(4), 720–747 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  4. Cieslak, D., Chawla, N., Striegel, A.: Combating imbalance in network intrusion datasets. In: IEEE International Conference Granular Computing, pp. 732–737 (2006)

    Google Scholar 

  5. Mazurowski, M.A., Habas, P.A., Zurada, J.M., Lo, J.Y., Baker, J.A., Tourassi, G.D.: Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw. 21(2–3), 427–436 (2008)

    Article  Google Scholar 

  6. Freitas, A., Costa-Pereira, A., Brazdil, P.: Cost-sensitive decision trees applied to medical data. In: Song, I., Eder, J., Nguyen, T. (eds.) Data Warehousing Knowl. Discov. Lecture Notes Series in Computer Science

    Google Scholar 

  7. Kilic, K., Uncu, Ö., Türksen, I.B.: Comparison of different strategies of utilizing fuzzy clustering in structure identification. Inf. Sci. 177(23), 5153–5162 (2007)

    Google Scholar 

  8. Celebi, M.E., Kingravi, H.A., Uddin, B., Iyatomi, H., Aslandogan, Y.A., Stoecker, W.V., Moss, R.H.: A methodological approach to the classification of dermoscopy images. Comput. Med. Imag. Grap. 31(6), 362–373 (2007)

    Article  Google Scholar 

  9. Peng, X., King, I.: Robust BMPM training based on second-order cone programming and its application in medical diagnosis. Neural Netw. 21(2–3), 450–457 (2008). Berlin/Heidelberg, Germany: Springer, 2007, vol. 4654, pp. 303–312

    Google Scholar 

  10. Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proceedings International Conference ACM Special Interest Group Manage Data, pp. 73–84 (1998)

    Google Scholar 

  11. Liu, M.H., Jiang, X.D., Kot, A.C.: A multi-prototype clustering algorithm. Pattern Recognit. 42, 689–698 (2009)

    Article  MATH  Google Scholar 

  12. Lago-Fernándezn, L.F., Aragón, J., Martínez-Muñoz, G., González, A.M., Sánchez-Montañés, M.: Cluster validation in problems with increasing dimensionality and unbalanced clusters. Neurocomputing, Elsevier 123, 33–39 (2014)

    Google Scholar 

  13. Alejo, R., García, V., Pacheco-Sánchez, J.H.: An efficient over-sampling approach based on mean square error back propagation for dealing with the multi-class imbalance problem. Neural Process Lett, Elsivier. doi:10.1007/s11063-014-9376-3

    Google Scholar 

  14. Wang, Q.: A hybrid sampling SVM approach to imbalanced data classification. Hindawi Publishing Corporation Abstract and Applied Analysis, vol. 2014, p. 7. Article ID 972786. http://dx.doi.org/10.1155/2014/972786

  15. Santhosh Kumar, N., Nageswara Rao, K.,·Govardhan, A., Sudheer Reddy, K., Ali Mirza, M.: Undersampled K-means approach for handling imbalanced distributed data. Prog. Artif. Intell. Springer. doi:10.1007/s13748-014-0045-6

    Google Scholar 

  16. Brzezinski, D., Stefanowski. J.: Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans. Neural Networks Learn. Syst. http://dx.doi.org/10.1109/TNNLS.2013.2251352

  17. Poolsawad, N., Kambhampati, C., Cleland, J.G.F.: Balancing class for performance of classification with a clinical dataset. In: Proceedings of the World Congress on Engineering 2014, vol. I, WCE n, U.K

    Google Scholar 

  18. Oreški, G., Oreški, S.: An experimental comparison of classification algorithm performances for highly imbalanced datasets. Presented at CECIIS 2014

    Google Scholar 

  19. Stefanowski, J.: Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. Emerg. Paradig. Mach. Learn. Smart Innov. Syst. Technol. 13, 277–306 (2013)

    Article  Google Scholar 

  20. Tomašev, N., Mladeni, D.: Class imbalance and the curse of minority hubs. Knowledge-Based Syst. J. (2013). doi:http://dx.doi.org/10.1016/j.knosys.2013.08.031

    Google Scholar 

  21. Santhosh Kumar, Ch.N., Nageswara Rao, K., Govardhan, A., Sudheer Reddy, K., Mahmood, A.M.: Undersampled K-means approach for handling imbalanced distributed data. Progress in Artificial Intelligence. ISSN:2192-6352 Prog Artif. Intell. 3, 29–38 (2014). doi:10.1007/s13748-014-0045-6. Published in Springer-Verlag Berlin Heidelberg April 2014

    Google Scholar 

  22. Santhosh Kumar, Ch.N., Nageswara Rao, K., Govardhan, A., Sudheer Reddy, K.: Imbalanced K- means: An algorithm to cluster imbalanced—distributed data. Int. J. Eng. Techn. Res. (IJETR). vol.2, Issue-2, Feb. 2014. ISSN:2321-0869

    Google Scholar 

  23. Santhosh Kumar, Ch.N., Nageswara Rao, K., Govardhan, A., Sandhya, N.: Subset K-Means approach for handling imbalanced-distributed data. Springer International Publication Switzerland 2015—Emerging ICT for Bridging the Future—Proceedings of the 49th Annual Convention of the Computer Society of India CSI, vol. 2. Advances in Intelligent Systems and Computing, vol. 338. doi:10.1007/978-3-319-13731-5_54, 2015, pp. 497–508. Published in Springer International Publication Switzerland 2015

    Google Scholar 

  24. Blake, C., Merz, C.J.: UCI repository of machine learning databases. Machine-readable data repository. Department of Information and Computer Science, University of California at Irvine, Irvine (2000). http://www.ics.uci.edu/mlearn/MLRepository.html

  25. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ch. N. Santhosh Kumar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer India

About this paper

Cite this paper

Santhosh Kumar, C.N., Rao, K.N., Govardhan, A. (2016). An Empirical Comparative Study of Novel Clustering Algorithms for Class Imbalance Learning. In: Satapathy, S., Raju, K., Mandal, J., Bhateja, V. (eds) Proceedings of the Second International Conference on Computer and Communication Technologies. Advances in Intelligent Systems and Computing, vol 380. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2523-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-81-322-2523-2_17

  • Published:

  • Publisher Name: Springer, New Delhi

  • Print ISBN: 978-81-322-2522-5

  • Online ISBN: 978-81-322-2523-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics