Abstract
Similarity between complex data objects is one of the central notions in data mining. We propose certain similarity (or distance) measures between various components of a 0/1 relation. We define measures between attributes, between rows, and between subrelations of the database. They find important applications in clustering, classification, and several other data mining processes. Our measures are based on the contexts of individual components. For example, two products (i.e., attributes) are deemed similar if their respective sets of customers (i.e., subrelations) are similar. This reveals more subtle relationships between components, something that is usually missing in simpler measures. Our problem of finding distance measures can be formulated as a system of nonlinear equations. We present an iterative algorithm which, when seeded with random initial values, converges quickly to stable distances in practice (typically requiring less than five iterations). The algorithm requires only one database scan. Results on artificial and real data show that our method is efficient, and produces results with intuitive appeal.
Chapter PDF
References
R. Agrawal, C. Faloutsos and A. Swami. Efficient similarity search in sequence databases. In Proc. of the 4th Intl. Conf. on Foundations of Data Organization and Algorithms (FODO’93), 1993.
R. Agrawal, K.-I. Lin, H. S. Sawhney and K. Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In Proc. of the 21st Intl. Conf. on Very Large Data Bases (VLDB), 1995, pp 490–501.
P. Cheeseman and J. Stutz. Bayesian classification (Autoclass): theory and results. In Advances in Knowledge Discovery and Data Mining, MIT Press, 1996, pp. 153–180.
G. Das, H. Mannila, and P. Ronkainen. Similarity of attributes by external probes. In Proc. of the 4th Intl. Conf. on Knowledge Discovery and Data Mining (KDD), 1998, pp 23–29.
S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the Society of Information Science, 41(6), 1990, 391–407.
T. Eiter and H. Mannila. Distance measures for point sets and their computation. Acta Informatica, 34(2), 1997, pp. 109–133.
Venkatesh Ganti, J. E. Gehrke, and Raghu Ramakrishnan. CACTUSustering categorical data Using summaries. In Proc. of the 5th Intl. Conf on Knowledge Discovery and Data Mining (KDD), 19
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: an approach based on dynamical systems. In Proc. of VLDB, 1998, pp. 311–322.
D. Gibson, J. Kleinberg, and P. Raghavan. Inferring Web communities from link topology. In Proc. 9th ACM Conference on Hypertext and Hypermedia, 1998.
D. Q. Goldin and P. Kanellakis. On similarity queries for time-series data: Constraint Specification and Implementation. In Intl. Conf. on Principles and Practices of Constraint Programming, 1995.
S. Guha, R. Rastogi and K. Shim. ROCK: A Robust Clustering Algorithm for Categorical Attributes. In Proc. of ICDE, 1999, pp. 512–521.
E.-H. Han, G. Karypis, V. Kumar and B. Mobasher. Clustering Based On Association Rule Hypergraphs. In Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
J. Han, Y. Cai and N. Cercone. Knowledge discovery in databases: an attribute oriented approach. In Proc. of the 18th Conf. on Very Large Data Bases (VLDB), 1992, pp. 547–559.
H.V. Jagadish, A. O. Mendelzon and T. Milo. Similarity-based queries. In Proc. of 14th Symp. on Principles of Database Systems (PODS), 1995, pp. 36–45.
Y. Karov and S. Edelman. Similarity-based word sense disambiguation. Computational Linguistics, 24(1), 1998, pp. 41–59.
J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. 9th ACM-SIAM Symposium on Discrete Algorithms (SODA), 1998.
A. J. Knobbe and P. W. Adriaans. Analyzing binary associations. In Proc. of the 2nd Intl. Conf. on Knowledge Discovery and Data Mining (KDD), 1996, pp. 311–314.
R. T. Ng and J. Han. Eficient and effective clustering methods for spatial data mining. In Proc. of VLDB, 1994, pp. 144–155.
D. Rafiei and A. Mendelzon. Similarity-based queries for time series data. SIGMOD Record, 26(2), 1997, pp. 13–25.
R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. of the 21st Intl. Conf. on Very Large Data Bases (VLDB), 1995, pp. 407–419.
G. Strang. Linear algebra and its applications. Harcourt Brace International Edition, 1988.
D. A. White and R. Jain. Algorithms and strategies for similarity retrieval. Technical Report VCL-96-101, Visual Computing Laboratory, UC Davis, 1996.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In Proc. of SIGMOD, 1996, pp. 103–114.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Das, G., Mannila, H. (2000). Context-Based Similarity Measures for Categorical Databases. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2000. Lecture Notes in Computer Science(), vol 1910. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45372-5_20
Download citation
DOI: https://doi.org/10.1007/3-540-45372-5_20
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41066-9
Online ISBN: 978-3-540-45372-7
eBook Packages: Springer Book Archive