, 43:37 | Cite as

An equi-biased k-prototypes algorithm for clustering mixed-type data

  • Ravi Sankar Sangam
  • Hari Om


Clustering has been recognized as a very important approach for data analysis that partitions the data according to some (dis)similarity criterion. In recent years, the problem of clustering mixed-type data has attracted many researchers. The k-prototypes algorithm is well known for its scalability in this respect. In this paper, the limitations of dissimilarity coefficient used in the k-prototypes algorithm are discussed with some illustrative examples. We propose a new hybrid dissimilarity coefficient for k-prototypes algorithm, which can be applied to the data with numerical, categorical and mixed attributes. Besides retaining the scalability of the k-prototypes algorithm in our method, the dissimilarity functions for either-type attributes are defined on the same scale with respect to their dimensionality, which is very beneficial to improve the efficiency of clustering result. The efficacy of our method is shown by experiments on real and synthetic data sets.


Data clustering data mining k-prototypes similarity coefficient 


  1. 1.
    Chen M S, Han J and Yu P S 1996 Data mining: an overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering 8(6): 866–883CrossRefGoogle Scholar
  2. 2.
    Jain A K, Duin R P W and Mao J 2000 Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1): 4–37CrossRefGoogle Scholar
  3. 3.
    Masulli F and Schenone A 1999 A fuzzy clustering based segmentation system as support to diagnosis in medical imaging. Artificial Intelligence in Medicine 16(2): 129–147CrossRefGoogle Scholar
  4. 4.
    Chen L, Zou L J, and Tu L 2012 A clustering algorithm for multiple data streams based on spectral component similarity. Information Sciences 183(1): 35–47CrossRefGoogle Scholar
  5. 5.
    Krishna K, Ramakrishnan K R and Thathachar M A L 1997 Vector quantization using genetic k-means algorithm for image compression. In: IEEE Proceedings of International Conference on Information Communications and Signal Processing, vol. 3, pp. 1585–1587CrossRefGoogle Scholar
  6. 6.
    Charikar M, Chekuri C, Feder T and Motwani R 2004 Incremental clustering and dynamic information retrieval. SIAM Journal on Computing 33(6): 1417–1440MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Han J, Pei J and Kamber M 2011 Data mining: concepts and techniques. ElsevierGoogle Scholar
  8. 8.
    Anderberg M R 2014 Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks. Academic PressGoogle Scholar
  9. 9.
    MacQueen J 1967 Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1(14), pp. 281–297MathSciNetzbMATHGoogle Scholar
  10. 10.
    Dunn J C 1973 A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics 3: 32–57MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Huang Z 1997 A fast clustering algorithm to cluster very large categorical data sets in data mining. Data Mining and Knowledge Discovery 3(8): 34–39Google Scholar
  12. 12.
    Huang Z and Ng M K 1999 A fuzzy k-modes algorithm for clustering categorical data. IEEE Transactions on Fuzzy Systems 7(4): 446–452CrossRefGoogle Scholar
  13. 13.
    Guha S, Rastogi R and Shim K 1999 ROCK: a robust clustering algorithm for categorical attributes. In: IEEE Proceedings of the Fifteenth International Conference on Data Engineering, pp. 512–521Google Scholar
  14. 14.
    Barbara D, Li Y and Couto J 2002 COOLCAT: an entropy-based algorithm for categorical clustering. In: ACM Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 582–589Google Scholar
  15. 15.
    Hsu C C and Chen Y C 2007 Mining of mixed data with application to catalog marketing. Expert Systems with Applications 32(1): 12–23CrossRefGoogle Scholar
  16. 16.
    Li C and Biswas G 2002 Unsupervised learning with mixed numeric and nominal data. IEEE Transactions on Knowledge and Data Engineering 14(4): 673–690CrossRefGoogle Scholar
  17. 17.
    Huang Z 1997 Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific–Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 21–34.Google Scholar
  18. 18.
    Huang Z 1998 Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2(3): 283–304MathSciNetCrossRefGoogle Scholar
  19. 19.
    Berkhin P 2006 A survey of clustering data mining techniques. In: Grouping multidimensional data, pp. 25–71Google Scholar
  20. 20.
    Gan G, Ma C and Wu J 2007 Data clustering: theory, algorithms, and applications. Society for Industrial and Applied MathematicsGoogle Scholar
  21. 21.
    Jain A K, Murty M N and Flynn P J 1999 Data clustering: a review. ACM Computing Surveys (CSUR) 31(3): 264–323CrossRefGoogle Scholar
  22. 22.
    Xu R and Wunsch D 2005 Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3): 645–678CrossRefGoogle Scholar
  23. 23.
    Goodall D W 1966 A new similarity index based on probability. Biometrics 22(4): 882–907CrossRefGoogle Scholar
  24. 24.
    He Z, Xu X and Deng S 2005 Scalable algorithms for clustering large datasets with mixed type attributes. International Journal of Intelligent Systems 20(10): 1077–1089CrossRefzbMATHGoogle Scholar
  25. 25.
    He Z, Xu X and Deng S 2002 Squeezer: an efficient algorithm for clustering categorical data. Journal of Computer Science and Technology 17(5): 611–624MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    David G and Averbuch A 2012 SpectralCAT: categorical spectral clustering of numerical and nominal data. Pattern Recognition 45(1): 416–433CrossRefzbMATHGoogle Scholar
  27. 27.
    Luo H, Kong F and Li Y 2006 Clustering mixed data based on evidence accumulation. In: Advanced data mining and applications. Berlin–Heidelberg: Springer, pp. 348–355CrossRefGoogle Scholar
  28. 28.
    Cheeseman P and Stutz J 1996 Bayesian classification (AutoClass): theory and results. In: Advances in knowledge discovery and data mining, pp. 61–83Google Scholar
  29. 29.
    Chiu T, Fang D, Chen J, Wang Y and Jeris C 2001 A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 263–268Google Scholar
  30. 30.
    Chen H L, Chuang K T and Chen M S 2008 On data labeling for clustering categorical data. IEEE Transactions on Knowledge and Data Engineering 20(11): 1458–1472CrossRefGoogle Scholar
  31. 31.
    Cheung Y M and Jia H 2013 Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognition 46(8): 2228–2238CrossRefzbMATHGoogle Scholar
  32. 32.
    Ji J, Bai T, Zhou C, Ma C and Wang Z 2013 An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120: 590–596CrossRefGoogle Scholar
  33. 33.
    San O M, Huynh V N and Nakamori Y 2004 An alternative extension of the k-means algorithm for clustering categorical data. International Journal of Applied Mathematics and Computer Science 14: 241–247MathSciNetzbMATHGoogle Scholar
  34. 34.
    He Z, Deng S and Xu X 2005 Improving k-modes algorithm considering frequencies of attribute values in mode. In: Computational intelligence and security. Berlin–Heidelberg: Springer, pp. 157–162CrossRefGoogle Scholar
  35. 35.
    Ng M K, Li M J, Huang J Z and He Z 2007 On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3): 503–507CrossRefGoogle Scholar
  36. 36.
    Rokach L 2005 A survey of clustering dlgorithms. In: Maimon O Z and Rokach L (Eds.) Data mining and knowledge discovery handbook. New York: SpringerGoogle Scholar
  37. 37.
    Gabor M 1999 The datgen dataset generator.
  38. 38.
    Bache K and Lichman M 2013 UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science

Copyright information

© Indian Academy of Sciences 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringIndian Institute of Technology (ISM)DhanbadIndia

Personalised recommendations