Clustering of Mixed-Type Data Considering Concept Hierarchies

Behzadi, Sahar; Müller, Nikola S.; Plant, Claudia; Böhm, Christian

doi:10.1007/978-3-030-16148-4_43

Clustering of Mixed-Type Data Considering Concept Hierarchies

Sahar Behzadi¹⁹,
Nikola S. Müller²⁰,
Claudia Plant^19,21 &
…
Christian Böhm²²

Conference paper
First Online: 22 March 2019

2715 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11439))

Abstract

Most clustering algorithms have been designed only for pure numerical or pure categorical data sets while nowadays many applications generate mixed data. It arises the question how to integrate various types of attributes so that one could efficiently group objects without loss of information. It is already well understood that a simple conversion of categorical attributes into a numerical domain is not sufficient since relationships between values such as a certain order are artificially introduced. Leveraging the natural conceptual hierarchy among categorical information, concept trees summarize the categorical attributes. In this paper we propose the algorithm ClicoT (CLustering mixed-type data Including COncept Trees) which is based on the Minimum Description Length (MDL) principle. Profiting of the conceptual hierarchies, ClicoT integrates categorical and numerical attributes by means of a MDL based objective function. The result of ClicoT is well interpretable since concept trees provide insights of categorical data. Extensive experiments on synthetic and real data set illustrate that ClicoT is noise-robust and yields well interpretable results in a short runtime.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63, 503–527 (2007)
Article Google Scholar
Behzadi, S., Ibrahim, M.A., Plant, C.: Parameter free mixed-type density-based clustering. In: Hartmann, S., Ma, H., Hameurlain, A., Pernul, G., Wagner, R.R. (eds.) DEXA 2018. LNCS, vol. 11030, pp. 19–34. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98812-2_2
Chapter Google Scholar
Böhm, C., Faloutsos, C., Pan, J., Plant, C.: Robust information-theoretic clustering. In: KDD (2006)
Google Scholar
Böhm, C., Goebl, S., Oswald, A., Plant, C., Plavinski, M., Wackersreuther, B.: Integrative parameter-free clustering of data with mixed type attributes. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS (LNAI), vol. 6118, pp. 38–47. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13657-3_7
Chapter Google Scholar
He, Z., Xu, X., Deng, S.: Clustering mixed numeric and categorical data: a cluster ensemble approach. CoRR abs/cs/0509011 (2005)
Google Scholar
Hsu, C.C., Chen, C.L., Su, Y.W.: Hierarchical clustering of mixed data based on distance hierarchy. Inf. Sci. 177(20), 4474–4492 (2007)
Article Google Scholar
Hsu, C.C., Chen, Y.C.: Mining of mixed data with application to catalog marketing. Expert Syst. Appl. 32(1), 12–23 (2007)
Article Google Scholar
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2, 283–304 (1998)
Article Google Scholar
McParland, D., Gormley, I.C.: Model based clustering for mixed data: ClustMD. Adv. Data Anal. Classif. 10(2), 155–169 (2016)
Article MathSciNet MATH Google Scholar
Plant, C., Böhm, C.: INCONCO: interpretable clustering of numerical and categorical objects. In: KDD, pp. 1127–1135 (2011)
Google Scholar
Rissanen, J.: A universal prior for integers and estimation by minimum description length. Ann. Stat. 11(2), 416–31 (1983)
Article MathSciNet MATH Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: ICML (2009)
Google Scholar
Yin, J., Tan, Z.: Clustering mixed type attributes in large dataset. In: Pan, Y., Chen, D., Guo, M., Cao, J., Dongarra, J. (eds.) ISPA 2005. LNCS, vol. 3758, pp. 655–661. Springer, Heidelberg (2005). https://doi.org/10.1007/11576235_66
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, Data Mining, University of Vienna, Vienna, Austria
Sahar Behzadi & Claudia Plant
Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
Nikola S. Müller
ds:UniVie, University of Vienna, Vienna, Austria
Claudia Plant
Ludwig-Maximilians-Universität München, Munich, Germany
Christian Böhm

Authors

Sahar Behzadi
View author publications
You can also search for this author in PubMed Google Scholar
Nikola S. Müller
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Plant
View author publications
You can also search for this author in PubMed Google Scholar
Christian Böhm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sahar Behzadi .

Editor information

Editors and Affiliations

Hong Kong University of Science and Technology, Hong Kong, China
Qiang Yang
Nanjing University, Nanjing, China
Zhi-Hua Zhou
University of Macau, Taipa, Macau, China
Zhiguo Gong
Southeast University, Nanjing, China
Min-Ling Zhang
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Sheng-Jun Huang

Appendices

Appendix

A Probability Adjustment

To adjust the probabilities for a numerical cluster-specific attribute we can safely use mean and variance corresponding to the cluster. In contrast, learning the cluster-specific concept hierarchy is more challenging since we need to maintain the integrity of a hierarchy. We need to assure that node probabilities of siblings in each level sum up to the probability of the parent node. Moreover node probabilities should sum up to one for each level. ProcessHierarchy() in Algorithm 2 is a recursive function to update the concept tree assuming marked cluster-specific elements. Simultaneously in this function, Propagatedown() tries to preserve the concept tree properties by propagating down the parents probabilities to their children.

B MPG

MPG is a slightly modified version of the data set provided in the StatLib library. The data concerns city-cycle fuel consumption in miles per gallon (MPG) in terms of 3 categorical and 5 numerical attributes consisting of different characteristics of 397 cars. We consider MPG ranging from 10 to 46.6 as the ground truth and divide the range to 7 intervals of the same length. Considering a concept hierarchy for the name of cars we group all the cars so that we have three branches: European, American, Japanese cars. Moreover we divide the range of model year attribute to three intervals: 70–74, 75–80, after 80. We leave the third attribute as a flat concept hierarchy since there is no meaningful hierarchy between variation of cylinders.

C Adult Dataset

Adult data set, extracted from the census bureau database, consists of 48,842 instances of 11 attributes excluding the attributes with missing values (six numerical and 5 categorical). The class attribute Salary indicates whether the salary is over 50K or lower. Categorical attributes consist of different information e.g. work-class, education, occupation and so on. Figure 7 indicates concept hierarchies for three selected categorical attributes, including work-class, relationship and education.

D Open Flights Dataset

Clustering results applying various algorithms with a better resolution illustrating is provided here (Figs. 8, 9, 10, 11 and 12).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Behzadi, S., Müller, N.S., Plant, C., Böhm, C. (2019). Clustering of Mixed-Type Data Considering Concept Hierarchies. In: Yang, Q., Zhou, ZH., Gong, Z., Zhang, ML., Huang, SJ. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2019. Lecture Notes in Computer Science(), vol 11439. Springer, Cham. https://doi.org/10.1007/978-3-030-16148-4_43

Download citation

DOI: https://doi.org/10.1007/978-3-030-16148-4_43
Published: 22 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16147-7
Online ISBN: 978-3-030-16148-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics