Using Clustering to Learn Distance Functions for Supervised Similarity Assessment

Eick, Christoph F.; Rouhana, Alain; Bagherjeiran, Abraham; Vilalta, Ricardo

doi:10.1007/11510888_13

Christoph F. Eick²⁰,
Alain Rouhana²⁰,
Abraham Bagherjeiran²⁰ &
…
Ricardo Vilalta²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3587))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

2083 Accesses
3 Citations

Abstract

Assessing the similarity between objects is a prerequisite for many data mining techniques. This paper introduces a novel approach to learn distance functions that maximizes the clustering of objects belonging to the same class. Objects belonging to a data set are clustered with respect to a given distance function and the local class density information of each cluster is then used by a weight adjustment heuristic to modify the distance function so that the class density is increased in the attribute space. This process of interleaving clustering with distance function modification is repeated until a “good” distance function has been found. We implemented our approach using the k-means clustering algorithm. We evaluated our approach using 7 UCI data sets for a traditional 1-nearest-neighbor (1-NN) classifier and a compressed 1-NN classifier, called NCC, that uses the learnt distance function and cluster centroids instead of all the points of a training set. The experimental results show that attribute weighting leads to statistically significant improvements in prediction accuracy over a traditional 1-NN classifier for 2 of the 7 data sets tested, whereas using NCC significantly improves the accuracy of the 1-NN classifier for 4 of the 7 data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning Distance Functions Using Equivalence Relations. In: Proc. ICML 2003, Washington D.C. (2003)
Google Scholar
Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases, Irvine, CA. University of California, Department of Information and Computer Science (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Eick, C., Zeidat, N.: Using Supervised Clustering to Enhance Classifiers. In: Hacid, M.-S., Murray, N.V., Raś, Z.W., Tsumoto, S. (eds.) ISMIS 2005. LNCS (LNAI), vol. 3488, pp. 248–256. Springer, Heidelberg (2005)
Chapter Google Scholar
Han, E.H., Karypis, G., Kumar, V.: Text Categorization Using Weight Adjusted nearest-neighbor Classification. In: Cheung, D., Williams, G.J., Li, Q. (eds.) PAKDD 2001. LNCS (LNAI), vol. 2035, p. 53. Springer, Heidelberg (2001)
Chapter Google Scholar
Hastie, T., Tibshirani, R.: Disriminant Adaptive Nearest-Neighbor Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 607–616 (1996)
Article Google Scholar
Klein, D., Kamvar, S.-D., Manning, C.: From instance-level Constraints to Space-level Constraints: Making the Most of Prior Knowledge in Data Clustering. In: Proc. ICML 2002, Sydney, Australia (2002)
Google Scholar
Kira, K., Rendell, L.: A practical Approach to Feature Selection. In: Proc. 9th Int. Conf. on Machine Learning (1992)
Google Scholar
MacQueen, J.: Some methods for classification and analysis of multi-variate observations. In: Proc. 5th Berkeley Symposium Math., Stat., Prob., vol. 1, pp. 281–297 (1967)
Google Scholar
Salzberg, S.: A nearest Hyperrectangle Learning Method, Machine Learning (1991)
Google Scholar
Stein, B., Niggemann, O.: Generation of Similarity Measures from Different Sources. In: Monostori, L., Váncza, J., Ali, M. (eds.) IEA/AIE 2001. LNCS (LNAI), vol. 2070, p. 197. Springer, Heidelberg (2001)
Chapter Google Scholar
Witten, I., Eibe, F.: Data Mining: Practical machine learning tools with Java implementations. In: Witten, I.H., Frank, E. (eds.). Morgan Kaufmann, San Francisco (2000)
Google Scholar
Xing, E.P., Ng, A., Jordan, M., Russell, S.: Distance Metric Learning with Applications to Clustering with Side Information. In: Advances in Neural Information Processing 15. MIT Press, Cambridge (2003)
Google Scholar
Zhang, Z.: Learning Metrics via Discriminant Kernels and Multi-Dimensional Scaling: Toward Expected Euclidean Representation. In: Proc. ICML 2003, Washington D.C. (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Houston, Houston, TX, 77204-3010, USA
Christoph F. Eick, Alain Rouhana, Abraham Bagherjeiran & Ricardo Vilalta

Authors

Christoph F. Eick
View author publications
You can also search for this author in PubMed Google Scholar
Alain Rouhana
View author publications
You can also search for this author in PubMed Google Scholar
Abraham Bagherjeiran
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Vilalta
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and applied Computer Sciences, IBaI, Germany
Petra Perner
Institute of Media and Information Technology, Chiba University, Japan
Atsushi Imiya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Eick, C.F., Rouhana, A., Bagherjeiran, A., Vilalta, R. (2005). Using Clustering to Learn Distance Functions for Supervised Similarity Assessment. In: Perner, P., Imiya, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2005. Lecture Notes in Computer Science(), vol 3587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11510888_13

Download citation

DOI: https://doi.org/10.1007/11510888_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26923-6
Online ISBN: 978-3-540-31891-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics