A Hybrid K-Means Algorithm Combining Preprocessing-Wise and Centroid Based-Criteria for High Dimension Datasets

Usman, Dauda; Mohamad, Ismail Bin

doi:10.1007/978-981-10-2772-7_11

Dauda Usman⁵ &
Ismail Bin Mohamad⁶

381 Accesses

Abstract

Data clustering is an unsupervised classification method aimed at creating groups of objects, or clusters that are distinct. Among the clustering techniques, K-means is the most widely used technique. Two issues are prominent in creating a K-means clustering algorithm—the optimal number of clusters and the center of the clusters. In most cases, the number of clusters is predetermined by the researcher, thus leaving out the challenge where to put the cluster centers so that scattered points can be grouped properly. However, if it is not chosen correctly it will increase the computational complexity especially for high dimensional data set. To obtain an optimum solution for K-means cluster analysis, the data needs to be preprocessed. This is achieved by either data standardization or using principal component analysis on a scale data to reduce the dimensionality of the data. Based on the outcomes of the preprocessing carried out on the data, a hybrid K-means clustering method of center initialization is developed for producing optimum quality clusters which makes the algorithm more efficient. The result showed that K-means with preprocessed data performed better, judging from the sum of square error. Further experiment on the hybrid K-means algorithm was conducted simulated datasets and it was observed that, the sum of the total clustering errors reduced significantly whereas inter distances between clusters are preserved to be as large as possible for better clusters identification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tsai, C.Y., Chiu, C.C.: Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm. Comput. Stat. Data Anal. 52, 4658–4672 (2008)
Article Google Scholar
Zhu, Y., Yu, J., Jia, C.: Initializing K-means clustering using affinity propagation. In: Ninth International Conference on Hybrid Intelligent Systems, vol. 1, pp. 338–343 (2009)
Google Scholar
Chandrasekhar, T., Thangavel, K., Elayaraja, E.: Effective clustering algorithms for gene expression data. Int. J. Comput. Appl. 32(4), 25–29 (2011)
Google Scholar
Chris, D., Xiaofeng, H.: K-means clustering via principal component analysis. In: Proceeding of the 21st International Conference on Machine Learning. Banff, Canada (2006)
Google Scholar
Rana, S., Jasola, S., Kumar, R.: A hybrid sequential approach for data clustering using K-means and particle swarm optimization algorithm. Int. J. Eng. Sci. Technol. 2(6), 167–176 (2010)
Google Scholar
Su, T., Dy, J.: A deterministic method for initializing K-means clustering. In: 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004, pp. 784–786 (2004)
Google Scholar
Arai, K., Barakbah, A.R.: Hierarchical K-means: an algorithm for centroids initialization for K-means. Rep. Fac. Sci. Eng. 36(1), 25–31 (2007). Saga University
Google Scholar
Karthikeyani, V.N., Thangavel, K.: Impact of normalization in distributed K-means clustering. Int. J. Soft Comput. 4(4), 168–172 (2009)
Google Scholar
Werner, M.: Identification of multivariate outliers in large data sets. Doctor Philosophy, University of Colorado, Denver (2003)
Google Scholar
Zhao, Y., Wang, E., Liu, H., Rotunno, M., Koshiol, J., Marincola, F.M., Teresa, M.L., McShane, M.L.: Evaluation of normalization methods for two channel MicroRNA microarrays. J. Transl. Med. 8, 62–69 (2010)
Article Google Scholar
Berry, M.J.A., Linoff, G.S.: Data mining techniques for marketing, sales and customer support. Wiley, New York (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, Faculty of Natural and Applied Sciences, Umaru Musa Yar’adua University, Katsina, Nigeria
Dauda Usman
Department of Mathematical Sciences, Faculty of Science, Universiti Teknologi Malaysia, Johor Bahru, Malaysia
Ismail Bin Mohamad

Authors

Dauda Usman
View author publications
You can also search for this author in PubMed Google Scholar
Ismail Bin Mohamad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dauda Usman .

Editor information

Editors and Affiliations

Universiti Teknologi MARA, Kedah, Merbok, Malaysia
Abd-Razak Ahmad
Universiti Teknologi MARA, Kedah, Merbok, Malaysia
Liew Kee Kor
Universiti Teknologi MARA, Kedah, Merbok, Malaysia
Illiasaak Ahmad
Universiti Teknologi Mara, Kedah, Merbok, Malaysia
Zanariah Idrus

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Usman, D., Mohamad, I.B. (2017). A Hybrid K-Means Algorithm Combining Preprocessing-Wise and Centroid Based-Criteria for High Dimension Datasets. In: Ahmad, AR., Kor, L., Ahmad, I., Idrus, Z. (eds) Proceedings of the International Conference on Computing, Mathematics and Statistics (iCMS 2015). Springer, Singapore. https://doi.org/10.1007/978-981-10-2772-7_11

Download citation

DOI: https://doi.org/10.1007/978-981-10-2772-7_11
Published: 24 November 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2770-3
Online ISBN: 978-981-10-2772-7
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics