Skip to main content

A Hybrid K-Means Algorithm Combining Preprocessing-Wise and Centroid Based-Criteria for High Dimension Datasets

  • Conference paper
  • First Online:
Proceedings of the International Conference on Computing, Mathematics and Statistics (iCMS 2015)

Abstract

Data clustering is an unsupervised classification method aimed at creating groups of objects, or clusters that are distinct. Among the clustering techniques, K-means is the most widely used technique. Two issues are prominent in creating a K-means clustering algorithm—the optimal number of clusters and the center of the clusters. In most cases, the number of clusters is predetermined by the researcher, thus leaving out the challenge where to put the cluster centers so that scattered points can be grouped properly. However, if it is not chosen correctly it will increase the computational complexity especially for high dimensional data set. To obtain an optimum solution for K-means cluster analysis, the data needs to be preprocessed. This is achieved by either data standardization or using principal component analysis on a scale data to reduce the dimensionality of the data. Based on the outcomes of the preprocessing carried out on the data, a hybrid K-means clustering method of center initialization is developed for producing optimum quality clusters which makes the algorithm more efficient. The result showed that K-means with preprocessed data performed better, judging from the sum of square error. Further experiment on the hybrid K-means algorithm was conducted simulated datasets and it was observed that, the sum of the total clustering errors reduced significantly whereas inter distances between clusters are preserved to be as large as possible for better clusters identification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Tsai, C.Y., Chiu, C.C.: Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm. Comput. Stat. Data Anal. 52, 4658–4672 (2008)

    Article  Google Scholar 

  2. Zhu, Y., Yu, J., Jia, C.: Initializing K-means clustering using affinity propagation. In: Ninth International Conference on Hybrid Intelligent Systems, vol. 1, pp. 338–343 (2009)

    Google Scholar 

  3. Chandrasekhar, T., Thangavel, K., Elayaraja, E.: Effective clustering algorithms for gene expression data. Int. J. Comput. Appl. 32(4), 25–29 (2011)

    Google Scholar 

  4. Chris, D., Xiaofeng, H.: K-means clustering via principal component analysis. In: Proceeding of the 21st International Conference on Machine Learning. Banff, Canada (2006)

    Google Scholar 

  5. Rana, S., Jasola, S., Kumar, R.: A hybrid sequential approach for data clustering using K-means and particle swarm optimization algorithm. Int. J. Eng. Sci. Technol. 2(6), 167–176 (2010)

    Google Scholar 

  6. Su, T., Dy, J.: A deterministic method for initializing K-means clustering. In: 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004, pp. 784–786 (2004)

    Google Scholar 

  7. Arai, K., Barakbah, A.R.: Hierarchical K-means: an algorithm for centroids initialization for K-means. Rep. Fac. Sci. Eng. 36(1), 25–31 (2007). Saga University

    Google Scholar 

  8. Karthikeyani, V.N., Thangavel, K.: Impact of normalization in distributed K-means clustering. Int. J. Soft Comput. 4(4), 168–172 (2009)

    Google Scholar 

  9. Werner, M.: Identification of multivariate outliers in large data sets. Doctor Philosophy, University of Colorado, Denver (2003)

    Google Scholar 

  10. Zhao, Y., Wang, E., Liu, H., Rotunno, M., Koshiol, J., Marincola, F.M., Teresa, M.L., McShane, M.L.: Evaluation of normalization methods for two channel MicroRNA microarrays. J. Transl. Med. 8, 62–69 (2010)

    Article  Google Scholar 

  11. Berry, M.J.A., Linoff, G.S.: Data mining techniques for marketing, sales and customer support. Wiley, New York (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dauda Usman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Usman, D., Mohamad, I.B. (2017). A Hybrid K-Means Algorithm Combining Preprocessing-Wise and Centroid Based-Criteria for High Dimension Datasets. In: Ahmad, AR., Kor, L., Ahmad, I., Idrus, Z. (eds) Proceedings of the International Conference on Computing, Mathematics and Statistics (iCMS 2015). Springer, Singapore. https://doi.org/10.1007/978-981-10-2772-7_11

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-2772-7_11

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-2770-3

  • Online ISBN: 978-981-10-2772-7

  • eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics