Hybrid data labeling algorithm for clustering large mixed type data

Sangam, Ravi Sankar; Om, Hari

doi:10.1007/s10844-014-0348-x

Hybrid data labeling algorithm for clustering large mixed type data

Published: 14 December 2014

Volume 45, pages 273–293, (2015)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Ravi Sankar Sangam¹ &
Hari Om¹

556 Accesses
8 Citations
3 Altmetric
Explore all metrics

Abstract

Due to enormous growth in both volume and variety of data, clustering a very large database is a time-consuming process. To speed up clustering process, sampling has been recognized as a very utilitarian approach to reduce the dataset size in which a collection of data points are taken as a sample and then a clustering algorithm is applied to partitioning the data points in that sample into clusters. In this approach, the data points, that are not sampled, do not get their cluster labels. The process of allocating unlabeled data points into proper clusters has been well explored purely in numerical or categorical domain only, but not the both. In this paper, we propose a hybrid similarity coefficient to find the resemblance between an unlabeled data point and a cluster, based on the importance of categorical attribute values and the mean values of numerical attributes. Furthermore, we propose a Hybrid Data Labeling Algorithm (HDLA), based on this similarity coefficient to designate an appropriate cluster label to each unlabeled data point. We analyze its time complexity and perform various experiments using synthetic and real world datasets to demonstrate the efficacy of HDLA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Han, J., & Kamber, M. (2006). Data mining, southeast asia edition: Concepts and techniques. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Chen, M.-S., Han, J., Yu, P.S. (1996). Data mining: An overview from a database perspective. IEEE Transactions on Knowledge and data Engineering, 8(6), 866–883.
Article Google Scholar
Jain, A.K., Duin, R.P.W., Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 4–37.
Article Google Scholar
Masulli, F., & Schenone, A. (1999). A fuzzy clustering based segmentation system as support to diagnosis in medical imaging. Artificial Intelligence in Medicine, 16(2), 129–147.
Article Google Scholar
Chen, L., Zou, L.-J., Tu, L. (2012). A clustering algorithm for multiple data streams based on spectral component similarity. Information Sciences, 183(1), 35–47.
Article Google Scholar
Krishna, K., Ramakrishnan, K.R., Thathachar, M. (1997). Vector quantization using genetic k-means algorithm for image compression. In Proceedings of international conference on information, communications and signal processing, ICICS (Vol. 3 pp. 1585–1587). IEEE.
Charikar, M., Chekuri, C., Feder, T., Motwani, R. (1997). Incremental clustering and dynamic information retrieval. In Proceedings of the 22th annual ACM symposium on theory of computing (pp. 626–635). ACM.
Jain, A.K., Murty, M.N., Flynn, P.J. (1999). Data clustering: A review. ACM computing Surveys (CSUR), 31(3), 264–323.
Article Google Scholar
Berkhin, P. (2004). Survey of clustering data mining techniques, 2002. San Jose, CA:Accrue Software.
Xu, R., Wunsch, D., et al. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678.
Article Google Scholar
Mishra, N., Oblinger, D., Pitt, L. (2001). Sublinear time approximate clustering. In Proceedings of the 12th annual ACM-SIAM symposium on discrete algorithms, Society for Industrial and Applied Mathematics (pp. 439–447).
Bradley, P.S., Fayyad, U.M., Reina, C., et al. (1998). Scaling clustering algorithms to large databases. In KDD (pp. 9–15).
MacQueen, J. et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 281–297). California.
Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.
Article Google Scholar
Chen, H.-L., Chuang, K.-T., Chen, M.-S. (2008). On data labeling for clustering categorical data. IEEE Transactions on Knowledge and Data Engineering, 20(11), 1458–1472.
Article Google Scholar
Huang, Z. (1997). Clustering large data sets with mixed numeric and categorical values. In Proceedings of the 1st pacific-asia conference on knowledge discovery and data mining,(PAKDD) (pp. 21–34). Singapore.
Cheung, Y.-M., & Jia, H. (2013). Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognition, 46(8), 2228–2238.
Article Google Scholar
Wang, S., Fan, Y., Zhang, C., Xu, H.X., Hao, X., Hu, Y. (2008). Entropy based clustering of data streams with mixed numeric and categorical values. In 7th IEEE/ACIS international conference on computer and information science,ICIS 08 (pp. 140–145). IEEE.
Chen, C.-Y., Hwang, S.-C., Oyang, Y.-J. (2005). A statistics-based approach to control the quality of subclusters in incremental gravitational clustering. Pattern Recognition, 38(12), 2256–2269.
Article Google Scholar
David, G., & Averbuch, A. (2012). Spectralcat: Categorical spectral clustering of numerical and nominal data. Pattern Recognition, 45(1), 416–433.
Article MATH MathSciNet Google Scholar
Luo, H., Kong, F., Li, Y. (2006). Clustering mixed data based on evidence accumulation. In Advanced data mining and applications (pp. 348–355). Berlin Heidelberg New York:Springer.
Ji, J., Bai, T., Zhou, C., Ma, C., Wang, Z. (2013). An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing, 120, 590–596.
Article Google Scholar
He, Z., Xu, X., Deng, S. (2005). Scalable algorithms for clustering large datasets with mixed type attributes. International Journal of Intelligent Systems, 20(10), 1077–1089.
Article MATH Google Scholar
Li, C., & Biswas, G. (2002). Unsupervised learning with mixed numeric and nominal data. IEEE Transactions on Knowledge and Data Engineering, 14(4), 673–690.
Article Google Scholar
Cao, F., & Liang, J. (2011). A data labeling method for clustering categorical data. Expert Systems with Applications, 38(3), 2381–2385.
Article Google Scholar
Maimon, O.Z., & Rokach, L. (2005). Data mining and knowledge discovery handbook, Vol. 1. Springer, Berlin Heidelberg New York.
Bache, K., & Lichman, M. (2013). Uci machine learning repository. University of california, School of information and computer science, Irvine, CA.
Alcalá, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F. (2010). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17, 255–287.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian School of Mines, Dhanbad, Jharkhand, 826004, India
Ravi Sankar Sangam & Hari Om

Authors

Ravi Sankar Sangam
View author publications
You can also search for this author in PubMed Google Scholar
Hari Om
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ravi Sankar Sangam.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sangam, R.S., Om, H. Hybrid data labeling algorithm for clustering large mixed type data. J Intell Inf Syst 45, 273–293 (2015). https://doi.org/10.1007/s10844-014-0348-x

Download citation

Received: 25 January 2014
Revised: 22 November 2014
Accepted: 25 November 2014
Published: 14 December 2014
Issue Date: October 2015
DOI: https://doi.org/10.1007/s10844-014-0348-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid data labeling algorithm for clustering large mixed type data

Abstract

Access this article

Similar content being viewed by others

A Survey of Constrained Clustering

Clustering mixed type data: a space structure-based approach

Clustering Techniques for Big Data Mining

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

A Survey of Constrained Clustering

Clustering mixed type data: a space structure-based approach

Clustering Techniques for Big Data Mining

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation