Abstract
Cluster Analysis consists of the aggregation of data items of a given set into subsets based on some similarity properties. Clustering techniques have been applied in many fields which typically involve a large amount of complex data. This study focuses on what we call multi-domain clustering and labeling, i.e. a set of techniques for multi-dimensional structured mixed data clustering. The work consists of studying the best mix of clustering techniques that address the problem in the multi-domain setting. Considered data types are numerical, categorical and textual. All of them can appear together within the same clustering scenario. We focus on k-means and agglomerative hierarchical clustering methods based on a new distance function we define for this specific setting. The proposed approach has been validated on some real and realistic data-sets based onto college, automobile and leisure fields. Experimental data allowed to evaluate the effectiveness of the different solutions, both for clustering and labeling.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Singh, M.P.: Deep web structure. IEEE Internet Computing 6, 4–5 (2002)
Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM, Society for Industrial and Applied Mathematics (2007)
Ahmad, A., Dey, L.: A method to compute distance between two categorical values of same attributein unsupervised learning for categorical data set. Pattern Recognition Letters 28(1), 110–118 (2007)
Duda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley and Sons, New York (1973)
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2, 283–304 (1998)
Macqueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Dept. of Computer Science, The University of British Columbia, Canada (1997)
Luo, H., Kong, F., Li, Y.: Clustering Mixed Data Based on Evidence Accumulation. In: Li, X., Zaïane, O.R., Li, Z.-h. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 348–355. Springer, Heidelberg (2006)
He, Z., Xu, X., Deng, S.: Scalable algorithms for clustering large datasets with mixed type attributes. International Journal of Intelligence Systems 20, 1077–1089 (2005)
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: SIGMOD Conference (1996)
Ankerst, M., Breunig, M., Kriegel, H., Sander, J.: Optics: ordering points to identify the clustering structure. In: ACM SIGMOD International Conference on Management of Data (1999)
Karypis, G., Han, E., Kumar, V.: Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer 32(8), 68–75 (1999)
Rousseeuw, P.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65 (1987)
Carpineto, C., Osinski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Computing Surveys 41 (2009)
Liu, X., Croft, B.W.: Cluster-based retrieval using language models. In: Proceedings of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval, vol. 1, pp. 186–193. ACM Press (2004)
Heart, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval, pp. 76–84. ACM Press (1996)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Comm. ACM 18, 613–620 (1975)
Everitt, B.S., Landau, S., Leese, M.: Cluster Analysis, 4th edn. Oxford Press (2001)
Esposito, F., Fanizzi, N., d’Amato, C.: Partitional Conceptual Clustering of Web Resources Annotated with Ontology Languages. In: Berendt, B., Mladenič, D., de Gemmis, M., Semeraro, G., Spiliopoulou, M., Stumme, G., Svátek, V., Železný, F. (eds.) Knowledge Discovery Enhanced with Semantic and Social Information. SCI, vol. 220, pp. 53–70. Springer, Heidelberg (2009)
Stepp, R.E., Michalski, R.S.: Conceptual clustering of structured objects: A goal-oriented approach. Artificial Intelligence 28(1), 43–69 (1986)
Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge UP, Cambridge (2008), Cluster Labeling Stanford Natural Language Processing Group (2009)
Gad, W.K., Kamel, M.S.: Incremental clustering algorithm based on phrase-semantic similarity histogram. In: Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, Qingdao (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Brambilla, M., Zanoni, M. (2012). Clustering and Labeling of Multi-dimensional Mixed Structured Data. In: Ceri, S., Brambilla, M. (eds) Search Computing. Lecture Notes in Computer Science, vol 7538. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34213-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-34213-4_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34212-7
Online ISBN: 978-3-642-34213-4
eBook Packages: Computer ScienceComputer Science (R0)