Abstract
We propose a new method for clustering categorical data. Clustering algorithms need to be designed specifically for categorical data because it has a different nature from numerical data. Here our focus is on the partition paradigm of algorithms. One existing approach is to transform categorical data into binary data and then use k-means. However it’s computationally inefficient. Another approach is k-modes, which extends k-means by replacing means with modes. In our work, we show that the center-based objective function of k-modes can not produce accurate clustering results. Instead, we propose an objective function that is generalized from the k-means objective, but not based on centers. We show that it’s more effective than the center-based objective and demonstrate it with real-life datasets. We also find that by using a particular algorithm called transfer algorithm, the proposed objective function can be efficiently solved. Thus our method is both efficient and effective.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8), 651–666 (2010)
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS–clustering categorical data using summaries. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 73–83. ACM (1999)
Gibson, D., Kleinberg, J., Raghavan, P.: Clustering categorical data: An approach based on dynamical systems. Databases, 1 (1998)
Barbar, D., Li, Y., Couto, J.: COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 582–589. ACM (2002)
Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: LIMBO: Scalable clustering of categorical data. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 123–146. Springer, Heidelberg (2004)
Guha, S., Rastogi, R., Shim, K.: ROCK: A robust clustering algorithm for categorical attributes. Information Systems 25(5), 345–366 (2000)
Steinley, D.: K–means clustering: A half–century synthesis. British Journal of Mathematical and Statistical Psychology 59(1), 1–34 (2006)
Ralambondrainy, H.: A conceptual version of the K-means algorithm. Pattern Recognition Letters 16(11), 1147–1157 (1995)
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: A comparative evaluation. Red 30(2), 3 (2008)
Park, H.S., Jun, C.H.: A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications 36(2), 3336–3341 (2009)
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2(3), 283–304 (1998)
He, Z., Deng, S., Xu, X.: Improving K-modes algorithm considering frequencies of attribute values in mode. In: Hao, Y., Liu, J., Wang, Y.-P., Cheung, Y.-M., Yin, H., Jiao, L., Ma, J., Jiao, Y.-C. (eds.) CIS 2005, Part I. LNCS (LNAI), vol. 3801, pp. 157–162. Springer, Heidelberg (2005)
San, O.M., Huynh, V.N., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. International Journal of Applied Mathematics and Computer Science 14(2), 241–248 (2004)
Bai, L., Liang, J., Dang, C., et al.: The Impact of Cluster Representatives on the Convergence of the K-Modes Type Clustering (2012)
Banfield, C.F., Bassill, L.C.: Algorithm AS 113. A transfer algorithm for non-hierarchical classification. Applied Statistics 26, 206–210 (1977)
Tarsitano, A.: A computational study of several relocation methods for k-means algorithms. Pattern Recognition 36(12), 2955–2966 (2003)
Ng, M.K., Li, M.J., Huang, J.Z., et al.: On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3), 503–507 (2007)
Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2013), http://archive.ics.uci.edu/ml
Gabor Melli. The datgen Dataset Generator, http://www.datasetgenerator.com
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xiang, Z., Ji, L. (2013). The Use of Transfer Algorithm for Clustering Categorical Data. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds) Advanced Data Mining and Applications. ADMA 2013. Lecture Notes in Computer Science(), vol 8347. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53917-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-53917-6_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53916-9
Online ISBN: 978-3-642-53917-6
eBook Packages: Computer ScienceComputer Science (R0)