A Mixed Similarity Measure in Near-Linear Computational Complexity for Distance-Based Methods

Binh Nguyen, Ngoc; Bao Ho, Tu

doi:10.1007/3-540-45372-5_21

A Mixed Similarity Measure in Near-Linear Computational Complexity for Distance-Based Methods

Ngoc Binh Nguyen⁴ &
Tu Bao Ho⁴

Conference paper
First Online: 01 January 2002

2673 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1910))

Abstract

Many methods of knowledge discovery and data mining are distance-based such as nearest neighbor classification or clustering where similarity measures between objects play an essential role. While real-world databases are often heterogeneous with mixed numeric and symbolic attributes, most available similarity measures can only be applied to either symbolic or numeric data. In such cases, data mining methods often require transforming numeric data into symbolic ones by discretization techniques. Mixed similarity measures (MSMs) without discretization of numeric values are desirable alternatives for objects with mixed symbolic and numeric data. However, the time and space complexities of computing available MSMs are often very high that make MSMs not applicable to large datasets. In the framework of Goodall’s MSM inspired by biological taxonomy, computing methods have been done but their time and space complexities so far are at least O(n2 log n2) and O(n2), respectively. In this work, we propose a new and efficient method for computing this MSM with O(n log n) time and O(n) space complexities. We demonstrate experimentally the applicability of new method to large datasets and suggest meta-knowledge on the use of this MSM. Practically, the experimental results show that only the near-linear time and space MSM could be applicable to mining large heterogeneous datasets.

Currently with the Faculty of Information Technology, Hanoi University of Technology. Dai Co Viet, Hai Ba Trung, Hanoi, VIETNAM.

Download to read the full chapter text

Chapter PDF

References

Goodall, D.W.: A New Similarity Index Based On Probability. Biometrics, Vol. 22 (1966) 882–907.
Article Google Scholar
Gowda, K.C., Diday, E.: Symbolic Clustering Using a New Similarity Measure. IEEE Transactions on Systems, Man, and Cybernetics, Vol. 22, No. 2 (1992) 368–378.
Article Google Scholar
Ho, T.B., Nguyen, N.B., Morita, T.: Study of a Mixed Similarity Measure for Classification and Clustering. 3th Pacific-Asia Conf. on Knowledge Discovery and Data Mining PAKDD’99. Lecture Notes in Artificial Intelligence 1574. Springer-Verlag (1999) 375–379.
Google Scholar
Huang, Z.: Clustering Large Data Sets With Mixed Numeric and Categorical Values. KDD: Techniques and Application. World Scientific (1997) 21–34.
Google Scholar
Ichino, M., Yaguchi, H.: Generalized Minkowski Metrics for Mixed Feature-Type Data Analysis. IEEE Trans. Systems, Man and Cybernetics, Vol. 24 (1994) 679–709.
Article MathSciNet Google Scholar
Lancaster, H.O.: The Combining of Probabilities Arising from Data in Discrete Distributions. Biometrika. Vol. 36 (1949) 370–382.
MathSciNet Google Scholar
Li, C., Biswas, G.: Unsupervised Clustering with Mixed Numeric and Nominal Data-A New Similarity Based Agglomerative System. KDD: Techniques and Application. World Scientific (1997) 33–48.
Google Scholar
Quinlan, J.R. C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993.
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Knowledge Science, Japan Advanced Institute of Science and Technology, 923-1292, Ishikawa, Tatsunokuchi, JAPAN
Ngoc Binh Nguyen & Tu Bao Ho

Authors

Ngoc Binh Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Tu Bao Ho
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Science, Norwegian University of Science and Technology, O.S. Bragstads plass 2E, 7491, Trondheim, Norway
Jan Komorowski
Department of Computer Science, University of North Carolina, Charlotte, NC 28223, USA
Jan Żytkow
Laboratoire ERIC, Université Lyon 2, 5 avenue Pierre Mendès-France, 69676, Bron, France
Djamel A. Zighed

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Binh Nguyen, N., Bao Ho, T. (2000). A Mixed Similarity Measure in Near-Linear Computational Complexity for Distance-Based Methods. In: Zighed, D.A., Komorowski, J., Żytkow, J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2000. Lecture Notes in Computer Science(), vol 1910. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45372-5_21

Download citation

DOI: https://doi.org/10.1007/3-540-45372-5_21
Published: 18 July 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41066-9
Online ISBN: 978-3-540-45372-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics