Context-Based Similarity Measures for Categorical Databases

Das, Gautam; Mannila, Heikki

doi:10.1007/3-540-45372-5_20

Context-Based Similarity Measures for Categorical Databases

Gautam Das⁴ &
Heikki Mannila⁵

Conference paper
First Online: 01 January 2002

3408 Accesses
24 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1910))

Abstract

Similarity between complex data objects is one of the central notions in data mining. We propose certain similarity (or distance) measures between various components of a 0/1 relation. We define measures between attributes, between rows, and between subrelations of the database. They find important applications in clustering, classification, and several other data mining processes. Our measures are based on the contexts of individual components. For example, two products (i.e., attributes) are deemed similar if their respective sets of customers (i.e., subrelations) are similar. This reveals more subtle relationships between components, something that is usually missing in simpler measures. Our problem of finding distance measures can be formulated as a system of nonlinear equations. We present an iterative algorithm which, when seeded with random initial values, converges quickly to stable distances in practice (typically requiring less than five iterations). The algorithm requires only one database scan. Results on artificial and real data show that our method is efficient, and produces results with intuitive appeal.

Download to read the full chapter text

Chapter PDF

References

R. Agrawal, C. Faloutsos and A. Swami. Efficient similarity search in sequence databases. In Proc. of the 4th Intl. Conf. on Foundations of Data Organization and Algorithms (FODO’93), 1993.
Google Scholar
R. Agrawal, K.-I. Lin, H. S. Sawhney and K. Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In Proc. of the 21st Intl. Conf. on Very Large Data Bases (VLDB), 1995, pp 490–501.
Google Scholar
P. Cheeseman and J. Stutz. Bayesian classification (Autoclass): theory and results. In Advances in Knowledge Discovery and Data Mining, MIT Press, 1996, pp. 153–180.
Google Scholar
G. Das, H. Mannila, and P. Ronkainen. Similarity of attributes by external probes. In Proc. of the 4th Intl. Conf. on Knowledge Discovery and Data Mining (KDD), 1998, pp 23–29.
Google Scholar
S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the Society of Information Science, 41(6), 1990, 391–407.
Article Google Scholar
T. Eiter and H. Mannila. Distance measures for point sets and their computation. Acta Informatica, 34(2), 1997, pp. 109–133.
Article MathSciNet Google Scholar
Venkatesh Ganti, J. E. Gehrke, and Raghu Ramakrishnan. CACTUSustering categorical data Using summaries. In Proc. of the 5th Intl. Conf on Knowledge Discovery and Data Mining (KDD), 19
Google Scholar
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: an approach based on dynamical systems. In Proc. of VLDB, 1998, pp. 311–322.
Google Scholar
D. Gibson, J. Kleinberg, and P. Raghavan. Inferring Web communities from link topology. In Proc. 9th ACM Conference on Hypertext and Hypermedia, 1998.
Google Scholar
D. Q. Goldin and P. Kanellakis. On similarity queries for time-series data: Constraint Specification and Implementation. In Intl. Conf. on Principles and Practices of Constraint Programming, 1995.
Google Scholar
S. Guha, R. Rastogi and K. Shim. ROCK: A Robust Clustering Algorithm for Categorical Attributes. In Proc. of ICDE, 1999, pp. 512–521.
Google Scholar
E.-H. Han, G. Karypis, V. Kumar and B. Mobasher. Clustering Based On Association Rule Hypergraphs. In Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
Google Scholar
J. Han, Y. Cai and N. Cercone. Knowledge discovery in databases: an attribute oriented approach. In Proc. of the 18th Conf. on Very Large Data Bases (VLDB), 1992, pp. 547–559.
Google Scholar
H.V. Jagadish, A. O. Mendelzon and T. Milo. Similarity-based queries. In Proc. of 14th Symp. on Principles of Database Systems (PODS), 1995, pp. 36–45.
Google Scholar
Y. Karov and S. Edelman. Similarity-based word sense disambiguation. Computational Linguistics, 24(1), 1998, pp. 41–59.
Google Scholar
J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. 9th ACM-SIAM Symposium on Discrete Algorithms (SODA), 1998.
Google Scholar
A. J. Knobbe and P. W. Adriaans. Analyzing binary associations. In Proc. of the 2nd Intl. Conf. on Knowledge Discovery and Data Mining (KDD), 1996, pp. 311–314.
Google Scholar
R. T. Ng and J. Han. Eficient and effective clustering methods for spatial data mining. In Proc. of VLDB, 1994, pp. 144–155.
Google Scholar
D. Rafiei and A. Mendelzon. Similarity-based queries for time series data. SIGMOD Record, 26(2), 1997, pp. 13–25.
Article Google Scholar
R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. of the 21st Intl. Conf. on Very Large Data Bases (VLDB), 1995, pp. 407–419.
Google Scholar
G. Strang. Linear algebra and its applications. Harcourt Brace International Edition, 1988.
Google Scholar
D. A. White and R. Jain. Algorithms and strategies for similarity retrieval. Technical Report VCL-96-101, Visual Computing Laboratory, UC Davis, 1996.
Google Scholar
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In Proc. of SIGMOD, 1996, pp. 103–114.
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, UK
Gautam Das
Nokia Research, UK
Heikki Mannila

Authors

Gautam Das
View author publications
You can also search for this author in PubMed Google Scholar
Heikki Mannila
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Science, Norwegian University of Science and Technology, O.S. Bragstads plass 2E, 7491, Trondheim, Norway
Jan Komorowski
Department of Computer Science, University of North Carolina, Charlotte, NC 28223, USA
Jan Żytkow
Laboratoire ERIC, Université Lyon 2, 5 avenue Pierre Mendès-France, 69676, Bron, France
Djamel A. Zighed

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Das, G., Mannila, H. (2000). Context-Based Similarity Measures for Categorical Databases. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2000. Lecture Notes in Computer Science(), vol 1910. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45372-5_20

Download citation

DOI: https://doi.org/10.1007/3-540-45372-5_20
Published: 18 July 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41066-9
Online ISBN: 978-3-540-45372-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics