Skip to main content
Log in

Hybrid data labeling algorithm for clustering large mixed type data

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Due to enormous growth in both volume and variety of data, clustering a very large database is a time-consuming process. To speed up clustering process, sampling has been recognized as a very utilitarian approach to reduce the dataset size in which a collection of data points are taken as a sample and then a clustering algorithm is applied to partitioning the data points in that sample into clusters. In this approach, the data points, that are not sampled, do not get their cluster labels. The process of allocating unlabeled data points into proper clusters has been well explored purely in numerical or categorical domain only, but not the both. In this paper, we propose a hybrid similarity coefficient to find the resemblance between an unlabeled data point and a cluster, based on the importance of categorical attribute values and the mean values of numerical attributes. Furthermore, we propose a Hybrid Data Labeling Algorithm (HDLA), based on this similarity coefficient to designate an appropriate cluster label to each unlabeled data point. We analyze its time complexity and perform various experiments using synthetic and real world datasets to demonstrate the efficacy of HDLA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Han, J., & Kamber, M. (2006). Data mining, southeast asia edition: Concepts and techniques. San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Chen, M.-S., Han, J., Yu, P.S. (1996). Data mining: An overview from a database perspective. IEEE Transactions on Knowledge and data Engineering, 8(6), 866–883.

    Article  Google Scholar 

  • Jain, A.K., Duin, R.P.W., Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 4–37.

    Article  Google Scholar 

  • Masulli, F., & Schenone, A. (1999). A fuzzy clustering based segmentation system as support to diagnosis in medical imaging. Artificial Intelligence in Medicine, 16(2), 129–147.

    Article  Google Scholar 

  • Chen, L., Zou, L.-J., Tu, L. (2012). A clustering algorithm for multiple data streams based on spectral component similarity. Information Sciences, 183(1), 35–47.

    Article  Google Scholar 

  • Krishna, K., Ramakrishnan, K.R., Thathachar, M. (1997). Vector quantization using genetic k-means algorithm for image compression. In Proceedings of international conference on information, communications and signal processing, ICICS (Vol. 3 pp. 1585–1587). IEEE.

  • Charikar, M., Chekuri, C., Feder, T., Motwani, R. (1997). Incremental clustering and dynamic information retrieval. In Proceedings of the 22th annual ACM symposium on theory of computing (pp. 626–635). ACM.

  • Jain, A.K., Murty, M.N., Flynn, P.J. (1999). Data clustering: A review. ACM computing Surveys (CSUR), 31(3), 264–323.

    Article  Google Scholar 

  • Berkhin, P. (2004). Survey of clustering data mining techniques, 2002. San Jose, CA:Accrue Software.

  • Xu, R., Wunsch, D., et al. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678.

    Article  Google Scholar 

  • Mishra, N., Oblinger, D., Pitt, L. (2001). Sublinear time approximate clustering. In Proceedings of the 12th annual ACM-SIAM symposium on discrete algorithms, Society for Industrial and Applied Mathematics (pp. 439–447).

  • Bradley, P.S., Fayyad, U.M., Reina, C., et al. (1998). Scaling clustering algorithms to large databases. In KDD (pp. 9–15).

  • MacQueen, J. et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 281–297). California.

  • Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.

    Article  Google Scholar 

  • Chen, H.-L., Chuang, K.-T., Chen, M.-S. (2008). On data labeling for clustering categorical data. IEEE Transactions on Knowledge and Data Engineering, 20(11), 1458–1472.

    Article  Google Scholar 

  • Huang, Z. (1997). Clustering large data sets with mixed numeric and categorical values. In Proceedings of the 1st pacific-asia conference on knowledge discovery and data mining,(PAKDD) (pp. 21–34). Singapore.

  • Cheung, Y.-M., & Jia, H. (2013). Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognition, 46(8), 2228–2238.

    Article  Google Scholar 

  • Wang, S., Fan, Y., Zhang, C., Xu, H.X., Hao, X., Hu, Y. (2008). Entropy based clustering of data streams with mixed numeric and categorical values. In 7th IEEE/ACIS international conference on computer and information science,ICIS 08 (pp. 140–145). IEEE.

  • Chen, C.-Y., Hwang, S.-C., Oyang, Y.-J. (2005). A statistics-based approach to control the quality of subclusters in incremental gravitational clustering. Pattern Recognition, 38(12), 2256–2269.

    Article  Google Scholar 

  • David, G., & Averbuch, A. (2012). Spectralcat: Categorical spectral clustering of numerical and nominal data. Pattern Recognition, 45(1), 416–433.

    Article  MATH  MathSciNet  Google Scholar 

  • Luo, H., Kong, F., Li, Y. (2006). Clustering mixed data based on evidence accumulation. In Advanced data mining and applications (pp. 348–355). Berlin Heidelberg New York:Springer.

  • Ji, J., Bai, T., Zhou, C., Ma, C., Wang, Z. (2013). An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing, 120, 590–596.

    Article  Google Scholar 

  • He, Z., Xu, X., Deng, S. (2005). Scalable algorithms for clustering large datasets with mixed type attributes. International Journal of Intelligent Systems, 20(10), 1077–1089.

    Article  MATH  Google Scholar 

  • Li, C., & Biswas, G. (2002). Unsupervised learning with mixed numeric and nominal data. IEEE Transactions on Knowledge and Data Engineering, 14(4), 673–690.

    Article  Google Scholar 

  • Cao, F., & Liang, J. (2011). A data labeling method for clustering categorical data. Expert Systems with Applications, 38(3), 2381–2385.

    Article  Google Scholar 

  • Maimon, O.Z., & Rokach, L. (2005). Data mining and knowledge discovery handbook, Vol. 1. Springer, Berlin Heidelberg New York.

  • Bache, K., & Lichman, M. (2013). Uci machine learning repository. University of california, School of information and computer science, Irvine, CA.

  • Alcalá, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F. (2010). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17, 255–287.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ravi Sankar Sangam.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sangam, R.S., Om, H. Hybrid data labeling algorithm for clustering large mixed type data. J Intell Inf Syst 45, 273–293 (2015). https://doi.org/10.1007/s10844-014-0348-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-014-0348-x

Keywords

Navigation