Abstract
Many methods of knowledge discovery and data mining are distance-based such as nearest neighbor classification or clustering where similarity measures between objects play an essential role. While real-world databases are often heterogeneous with mixed numeric and symbolic attributes, most available similarity measures can only be applied to either symbolic or numeric data. In such cases, data mining methods often require transforming numeric data into symbolic ones by discretization techniques. Mixed similarity measures (MSMs) without discretization of numeric values are desirable alternatives for objects with mixed symbolic and numeric data. However, the time and space complexities of computing available MSMs are often very high that make MSMs not applicable to large datasets. In the framework of Goodall’s MSM inspired by biological taxonomy, computing methods have been done but their time and space complexities so far are at least O(n2 log n2) and O(n2), respectively. In this work, we propose a new and efficient method for computing this MSM with O(n log n) time and O(n) space complexities. We demonstrate experimentally the applicability of new method to large datasets and suggest meta-knowledge on the use of this MSM. Practically, the experimental results show that only the near-linear time and space MSM could be applicable to mining large heterogeneous datasets.
Currently with the Faculty of Information Technology, Hanoi University of Technology. Dai Co Viet, Hai Ba Trung, Hanoi, VIETNAM.
Chapter PDF
References
Goodall, D.W.: A New Similarity Index Based On Probability. Biometrics, Vol. 22 (1966) 882–907.
Gowda, K.C., Diday, E.: Symbolic Clustering Using a New Similarity Measure. IEEE Transactions on Systems, Man, and Cybernetics, Vol. 22, No. 2 (1992) 368–378.
Ho, T.B., Nguyen, N.B., Morita, T.: Study of a Mixed Similarity Measure for Classification and Clustering. 3th Pacific-Asia Conf. on Knowledge Discovery and Data Mining PAKDD’99. Lecture Notes in Artificial Intelligence 1574. Springer-Verlag (1999) 375–379.
Huang, Z.: Clustering Large Data Sets With Mixed Numeric and Categorical Values. KDD: Techniques and Application. World Scientific (1997) 21–34.
Ichino, M., Yaguchi, H.: Generalized Minkowski Metrics for Mixed Feature-Type Data Analysis. IEEE Trans. Systems, Man and Cybernetics, Vol. 24 (1994) 679–709.
Lancaster, H.O.: The Combining of Probabilities Arising from Data in Discrete Distributions. Biometrika. Vol. 36 (1949) 370–382.
Li, C., Biswas, G.: Unsupervised Clustering with Mixed Numeric and Nominal Data-A New Similarity Based Agglomerative System. KDD: Techniques and Application. World Scientific (1997) 33–48.
Quinlan, J.R. C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Binh Nguyen, N., Bao Ho, T. (2000). A Mixed Similarity Measure in Near-Linear Computational Complexity for Distance-Based Methods. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2000. Lecture Notes in Computer Science(), vol 1910. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45372-5_21
Download citation
DOI: https://doi.org/10.1007/3-540-45372-5_21
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41066-9
Online ISBN: 978-3-540-45372-7
eBook Packages: Springer Book Archive