Abstract
Data mining is the process of discovering knowledge from the vast data sources. In Data mining, classification and clustering are the two broad branches of study. In Clustering, K-means algorithm is one of the bench mark algorithms used for numerous applications. The popularity of k-means algorithm is due to its efficient and low usage of memory. One of the short comings of k-means algorithm is degradation of performance, when applied to imbalance distributed data. The results of cluster size generated by k-means are relatively uniform, in spite of the input data with non-uniform cluster sizes, which is defined as “uniform effect” in the literature. This paper proposes several novel algorithms to solve the above said problem. The proposed algorithms are compared with each other. The experiments conducted with the proposed algorithm on eleven UCI datasets with evaluation metrics show that proposed algorithms are effective to solve the problem of “uniform effect.”
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Xiong, H., Wu, J.J., Chen, J.: K-means clustering versus validation measures: A data-distribution perspective. IEEE Trans. Syst. Man Cybern. B Cybern. 39(2), 318–331 (2009)
Lu, W.-Z., Wang, D.: Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Sci. Total. Environ. 395(2–3), 109–116 (2008)
Huang, Y.-M., Hung, C.-M., Jiau, H.C.: Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Anal. R. World Appl. 7(4), 720–747 (2006)
Cieslak, D., Chawla, N., Striegel, A.: Combating imbalance in network intrusion datasets. In: IEEE International Conference Granular Computing, pp. 732–737 (2006)
Mazurowski, M.A., Habas, P.A., Zurada, J.M., Lo, J.Y., Baker, J.A., Tourassi, G.D.: Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw. 21(2–3), 427–436 (2008)
Freitas, A., Costa-Pereira, A., Brazdil, P.: Cost-sensitive decision trees applied to medical data. In: Song, I., Eder, J., Nguyen, T. (eds.) Data Warehousing Knowl. Discov. Lecture Notes Series in Computer Science
Kilic, K., Uncu, Ö., Türksen, I.B.: Comparison of different strategies of utilizing fuzzy clustering in structure identification. Inf. Sci. 177(23), 5153–5162 (2007)
Celebi, M.E., Kingravi, H.A., Uddin, B., Iyatomi, H., Aslandogan, Y.A., Stoecker, W.V., Moss, R.H.: A methodological approach to the classification of dermoscopy images. Comput. Med. Imag. Grap. 31(6), 362–373 (2007)
Peng, X., King, I.: Robust BMPM training based on second-order cone programming and its application in medical diagnosis. Neural Netw. 21(2–3), 450–457 (2008). Berlin/Heidelberg, Germany: Springer, 2007, vol. 4654, pp. 303–312
Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proceedings International Conference ACM Special Interest Group Manage Data, pp. 73–84 (1998)
Liu, M.H., Jiang, X.D., Kot, A.C.: A multi-prototype clustering algorithm. Pattern Recognit. 42, 689–698 (2009)
Lago-Fernándezn, L.F., Aragón, J., Martínez-Muñoz, G., González, A.M., Sánchez-Montañés, M.: Cluster validation in problems with increasing dimensionality and unbalanced clusters. Neurocomputing, Elsevier 123, 33–39 (2014)
Alejo, R., García, V., Pacheco-Sánchez, J.H.: An efficient over-sampling approach based on mean square error back propagation for dealing with the multi-class imbalance problem. Neural Process Lett, Elsivier. doi:10.1007/s11063-014-9376-3
Wang, Q.: A hybrid sampling SVM approach to imbalanced data classification. Hindawi Publishing Corporation Abstract and Applied Analysis, vol. 2014, p. 7. Article ID 972786. http://dx.doi.org/10.1155/2014/972786
Santhosh Kumar, N., Nageswara Rao, K.,·Govardhan, A., Sudheer Reddy, K., Ali Mirza, M.: Undersampled K-means approach for handling imbalanced distributed data. Prog. Artif. Intell. Springer. doi:10.1007/s13748-014-0045-6
Brzezinski, D., Stefanowski. J.: Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans. Neural Networks Learn. Syst. http://dx.doi.org/10.1109/TNNLS.2013.2251352
Poolsawad, N., Kambhampati, C., Cleland, J.G.F.: Balancing class for performance of classification with a clinical dataset. In: Proceedings of the World Congress on Engineering 2014, vol. I, WCE n, U.K
Oreški, G., Oreški, S.: An experimental comparison of classification algorithm performances for highly imbalanced datasets. Presented at CECIIS 2014
Stefanowski, J.: Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. Emerg. Paradig. Mach. Learn. Smart Innov. Syst. Technol. 13, 277–306 (2013)
Tomašev, N., Mladeni, D.: Class imbalance and the curse of minority hubs. Knowledge-Based Syst. J. (2013). doi:http://dx.doi.org/10.1016/j.knosys.2013.08.031
Santhosh Kumar, Ch.N., Nageswara Rao, K., Govardhan, A., Sudheer Reddy, K., Mahmood, A.M.: Undersampled K-means approach for handling imbalanced distributed data. Progress in Artificial Intelligence. ISSN:2192-6352 Prog Artif. Intell. 3, 29–38 (2014). doi:10.1007/s13748-014-0045-6. Published in Springer-Verlag Berlin Heidelberg April 2014
Santhosh Kumar, Ch.N., Nageswara Rao, K., Govardhan, A., Sudheer Reddy, K.: Imbalanced K- means: An algorithm to cluster imbalanced—distributed data. Int. J. Eng. Techn. Res. (IJETR). vol.2, Issue-2, Feb. 2014. ISSN:2321-0869
Santhosh Kumar, Ch.N., Nageswara Rao, K., Govardhan, A., Sandhya, N.: Subset K-Means approach for handling imbalanced-distributed data. Springer International Publication Switzerland 2015—Emerging ICT for Bridging the Future—Proceedings of the 49th Annual Convention of the Computer Society of India CSI, vol. 2. Advances in Intelligent Systems and Computing, vol. 338. doi:10.1007/978-3-319-13731-5_54, 2015, pp. 497–508. Published in Springer International Publication Switzerland 2015
Blake, C., Merz, C.J.: UCI repository of machine learning databases. Machine-readable data repository. Department of Information and Computer Science, University of California at Irvine, Irvine (2000). http://www.ics.uci.edu/mlearn/MLRepository.html
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer India
About this paper
Cite this paper
Santhosh Kumar, C.N., Rao, K.N., Govardhan, A. (2016). An Empirical Comparative Study of Novel Clustering Algorithms for Class Imbalance Learning. In: Satapathy, S., Raju, K., Mandal, J., Bhateja, V. (eds) Proceedings of the Second International Conference on Computer and Communication Technologies. Advances in Intelligent Systems and Computing, vol 380. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2523-2_17
Download citation
DOI: https://doi.org/10.1007/978-81-322-2523-2_17
Published:
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-2522-5
Online ISBN: 978-81-322-2523-2
eBook Packages: EngineeringEngineering (R0)