Categorical Data Clustering Using the Combinations of Attribute Values

Do, Hee-Jung; Kim, Jae-Yearn

doi:10.1007/978-3-540-69848-7_19

Hee-Jung Do¹ &
Jae-Yearn Kim¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5073))

Included in the following conference series:

International Conference on Computational Science and Its Applications

1615 Accesses
1 Citations

Abstract

Clustering is an important technique for exploratory data analysis. While most of the earlier clustering algorithms focused on numerical data, real-world problems and data mining applications frequently involve categorical data. Here, we propose a new clustering algorithm for categorical data that is based on the frequency of attribute value combinations. Our algorithm finds all the combinations of attribute values in a record, which represent a subset of all the attribute values, and then groups the records using the frequency of these combinations. As our algorithm considers all the subsets of attribute values in a record, records in a cluster have not only similar attribute value sets but also strongly associated attribute values. We evaluated our algorithm with real and synthetic data sets, and the experimental results demonstrate the effectiveness of our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Huang, Z.: A Fast Clustering Algorithm to Cluster Very large Categorical Data Sets in Data Mining. In: Proceedings of ACM SIGMOD Workshop on Research Issues on data Mining and knowledge Discovery (1997)
Google Scholar
Gibson, D., Kleinberg, J., Raghavan, P.: Clustering Categorical Data: An Approach based on Dynamical. In: Proceedings of the 24th International Conference on Very Large Databases (1998)
Google Scholar
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS-Clustering Categorical Data Using Summaries. In: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 73–83 (1999)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: ROCK: A Robust Clustering Algorithm for Categorical Attributes. In: Proceedings of the 15th International Conference on Data Engineering (1999)
Google Scholar
Barbara, D., Couto, J., Li, Y.: COOLCAT: An entropy-based algorithm for categorical clustering. In: Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, pp. 590–599 (2002)
Google Scholar
Yun, C.H., Chuang, K.T., Chen, M.S.: Adherence clustering: an efficient method for mining market-basket clusters. Information Systems 31, 170–186 (2006)
Article Google Scholar
Hsu, C.C., Chen, Y.C.: Mining of Mixed data with application to catalog marketing. Expert Systems with Applications (2006)
Google Scholar
Kim, M., Ramarkrishna, R.S.: Projected clustering for categorical datasets. Pattern Recognition Letters 27, 1405–1417 (2006)
Article Google Scholar
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On Clustering Validation Techniques. Journal of Intelligent Information Systems (2001)
Google Scholar
Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)
Google Scholar
UCI machine Learning Repository, http://www.ics.uci.edu/~mlearn/MLRepository.html
Dataset Generator (DatGen), http://www.datasetgenerator.com
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering algorithms and validity measures. In: Thirteenth international conference on scientific and statistical database management, pp. 3–22 (2001)
Google Scholar
Chen, H.L., Chuang, K.T., Chen, M.S.: Labeling Unclustered Categorical Data into Clusters Based on the Important Attribute Values. In: Proceedings of the 5th IEEE International Conference on Data Mining, pp. 106–113 (2005)
Google Scholar
Mirkin, B.: Reinterpreting the Category Utility Function. Machine Learning, 1–11 (2001)
Google Scholar
Gluck, A., Corter, J.: Information, Uncertainty, and the utility of categories. In: Proceedings of the Seventh Annual Conference of the Cognitive Science society (1985)
Google Scholar
Ordonez, C., Omiecinski, E.: Efficient disk-based K-means clustering for relational database. IEEE Transactions on Knowledge and Data Engineering 16(8), 909–921 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Industrial Engineering, Hanyang University, Sungdong-gu, Seoul, 133-791, Korea
Hee-Jung Do & Jae-Yearn Kim

Authors

Hee-Jung Do
View author publications
You can also search for this author in PubMed Google Scholar
Jae-Yearn Kim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Osvaldo Gervasi Beniamino Murgante Antonio Laganà David Taniar Youngsong Mun Marina L. Gavrilova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Do, HJ., Kim, JY. (2008). Categorical Data Clustering Using the Combinations of Attribute Values. In: Gervasi, O., Murgante, B., Laganà, A., Taniar, D., Mun, Y., Gavrilova, M.L. (eds) Computational Science and Its Applications – ICCSA 2008. ICCSA 2008. Lecture Notes in Computer Science, vol 5073. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69848-7_19

Download citation

DOI: https://doi.org/10.1007/978-3-540-69848-7_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69840-1
Online ISBN: 978-3-540-69848-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics