We propose a new class of metrics on sets, vectors, and functions that can be used in various stages of data mining, including exploratory data analysis, learning, and result interpretation. These new distance functions unify and generalize some of the popular metrics, such as the Jaccard and bag distances on sets, Manhattan distance on vector spaces, and Marczewski-Steinhaus distance on integrable functions. We prove that the new metrics are complete and show useful relationships with f-divergences for probability distributions. To further extend our approach to structured objects such as ontologies, we introduce information-theoretic metrics on directed acyclic graphs drawn according to a fixed probability distribution. We conduct empirical investigation to demonstrate the effectiveness on real-valued, high-dimensional, and structured data. Overall, the new metrics compare favorably to multiple similarity and dissimilarity functions traditionally used in data mining, including the Minkowski (\(L^p\)) family, the fractional \(L^p\) family, two f-divergences, cosine distance, and two correlation coefficients. We provide evidence that they are particularly appropriate for rapid processing of high-dimensional and structured data in distance-based learning.
Distance Metric Ontology Machine learning Text mining High-dimensional data Computational biology
This is a preview of subscription content, log in to check access.
We thank Prof. Jovana Kovačević from the University of Belgrade for helpful discussions. We also thank the Action Editor and three anonymous reviewers for their insightful comments that have contributed to improved precision and quality of the paper.
This work was partially supported by the National Science Foundation (NSF) Grant DBI-1458477 (PR), the NSF Grant DMS-1206405 (EAH), and the Precision Health Initiative of Indiana University.
Hamerly G (2010) Making k-means even faster. Proc SIAM Int Conf Data Min (SDM) 2010:130–140Google Scholar
Hassanzadeh FF, Milenkovic O (2014) An axiomatic approach to constructing distances for rank comparison and aggregation. IEEE Trans Inf Theory 60(10):6417–6439MathSciNetCrossRefzbMATHGoogle Scholar
Hinneburg A et al (2000) What is the nearest neighbor in high dimensional spaces? Proc Int Conf Very Large Databases (VLDB) 2000:506–515Google Scholar
Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared nearest neighbors. IEEE Trans Comput C–22(11):1025–1034CrossRefGoogle Scholar
Jiang Y et al (2014) The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective. Bioinformatics 30(17):i609–i616CrossRefGoogle Scholar
Kryszkiewicz M, Lasek P (2010) TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality. Proc Int Conf Rough Sets Curr Trends Comput (RSCTC) 2010:60–69CrossRefGoogle Scholar
Kumar R, Vassilvitskii S (2010) Generalized distances between rankings. Proc Int Conf World Wide Web (WWW) 2010:571–580CrossRefGoogle Scholar
Moore AW (2000) The anchors hierarchy: using the triangle inequality to survive high dimensional data. Proc Conf Uncertain Artif Intell (UAI) 2000:397–405Google Scholar
Movshovitz-Attias Y et al (2015) Ontological supervision for fine grained classification of street view storefronts. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2015:1693–1702Google Scholar
Nehrt NL et al (2011) Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol 7(6):e1002073MathSciNetCrossRefGoogle Scholar
Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the annual meeting on association for computational linguistics (ACL) 2004Google Scholar
Pinsker MS (1964) Information and information stability of random variables and processes. Holden-DayGoogle Scholar
Tan PN et al (2006) Introduction to data mining. Pearson, New YorkGoogle Scholar
Ting KM et al (2016) Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. Proc Int Conf Knowl Discov Data Min (KDD) 2016:1205–1214Google Scholar
Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244zbMATHGoogle Scholar
Wu D et al (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees. PLoS ONE 6(3):e18011CrossRefGoogle Scholar