Advertisement

A new class of metrics for learning on real-valued and structured data

  • Ruiyu Yang
  • Yuxiang Jiang
  • Scott Mathews
  • Elizabeth A. Housworth
  • Matthew W. Hahn
  • Predrag RadivojacEmail author
Article

Abstract

We propose a new class of metrics on sets, vectors, and functions that can be used in various stages of data mining, including exploratory data analysis, learning, and result interpretation. These new distance functions unify and generalize some of the popular metrics, such as the Jaccard and bag distances on sets, Manhattan distance on vector spaces, and Marczewski-Steinhaus distance on integrable functions. We prove that the new metrics are complete and show useful relationships with f-divergences for probability distributions. To further extend our approach to structured objects such as ontologies, we introduce information-theoretic metrics on directed acyclic graphs drawn according to a fixed probability distribution. We conduct empirical investigation to demonstrate the effectiveness on real-valued, high-dimensional, and structured data. Overall, the new metrics compare favorably to multiple similarity and dissimilarity functions traditionally used in data mining, including the Minkowski (\(L^p\)) family, the fractional \(L^p\) family, two f-divergences, cosine distance, and two correlation coefficients. We provide evidence that they are particularly appropriate for rapid processing of high-dimensional and structured data in distance-based learning.

Keywords

Distance Metric Ontology Machine learning Text mining High-dimensional data Computational biology 

Notes

Acknowledgements

We thank Prof. Jovana Kovačević from the University of Belgrade for helpful discussions. We also thank the Action Editor and three anonymous reviewers for their insightful comments that have contributed to improved precision and quality of the paper.

Funding

This work was partially supported by the National Science Foundation (NSF) Grant DBI-1458477 (PR), the NSF Grant DMS-1206405 (EAH), and the Precision Health Initiative of Indiana University.

Supplementary material

10618_2019_622_MOESM1_ESM.xlsx (64 kb)
Supplementary material 1 (xlsx 63 KB)
10618_2019_622_MOESM2_ESM.pdf (1.1 mb)
Supplementary material 2 (pdf 1167 KB)

References

  1. Aggarwal CC et al (2001) On the surprising behavior of distance metrics in high dimensional space. Proc Int Conf Database Theory (ICDT) 2001:420–434zbMATHGoogle Scholar
  2. Ashburner M et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29CrossRefGoogle Scholar
  3. Bairoch A et al (2005) The universal protein resource (UniProt). Nucleic Acids Res 33(Databse issue):D154–D159Google Scholar
  4. Baraty S et al (2011) The impact of triangular inequality violations on medoid-based clustering. Proc Int Symp Methodol Intell Syst (ISMIS) 2011:280–289Google Scholar
  5. Bellet A et al (2013) A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709
  6. Ben-David S, Ackerman M (2009) Measures of clustering quality: a working set of axioms for clustering. Adv Neural Inf Process Syst (NIPS) 2009:121–128Google Scholar
  7. Beyer K et al (1999) When is “nearest neighbor” meaningful? Proc Int Conf Database Theory (ICDT) 1999:217–235Google Scholar
  8. Bilenko M et al (2004) Integrating constraints and metric learning in semi-supervised clustering. Proc Int Conf Mach Learn (ICML) 2004:81–88Google Scholar
  9. Cao M et al (2013) Going the distance for protein function prediction: a new distance metric for protein interaction networks. PLoS ONE 8(10):e76339CrossRefGoogle Scholar
  10. Cardoso-Cachopo A (2007) Improving methods for single-label text categorization. Ph.D. thesis, Instituto Superior Tecnico, Universidade Tecnica de LisboaGoogle Scholar
  11. Clark WT, Radivojac P (2013) Information-theoretic evaluation of predicted ontological annotations. Bioinformatics 29(13):i53–i61CrossRefGoogle Scholar
  12. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27CrossRefzbMATHGoogle Scholar
  13. Cover TM, Thomas JA (2006) Elements of information theory. Wiley, HobokenzbMATHGoogle Scholar
  14. Csiszár I (1967) Information-type measure of difference of probability distributions and indirect observations. Studia Sci Math Hungar 2:299–318MathSciNetzbMATHGoogle Scholar
  15. Dalkilic MM et al (2006) Using compression to identify classes of inauthentic papers. Proc SIAM Int Conf Data Min (SDM) 2006:604–608MathSciNetGoogle Scholar
  16. Deza MM, Deza E (2013) Encyclopedia of distances. Springer, BerlinCrossRefzbMATHGoogle Scholar
  17. Elkan C (2003) Using the triangle inequality to accelerate k-means. Proc Int Conf Mach Learn (ICML) 2003:147–153Google Scholar
  18. Goldfarb L (1992) What is distance and why do we need the metric model for pattern learning? Pattern Recognit 25(4):431–438CrossRefGoogle Scholar
  19. Greene D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering. Proc Int Conf Mach Learn (ICML) 2006:377–384Google Scholar
  20. Grosshans M et al (2014) Joint prediction of topics in a URL hierarchy. Proc Joint Eur Conf Mach Learn Knowl Disc Databases (ECML/PKDD) 2014:514–529Google Scholar
  21. Guntuboyina A (2011) Lower bounds for the minimax risk using \(f\)-divergences, and applications. IEEE Trans Inform Theory 57(4):2386–2399MathSciNetCrossRefzbMATHGoogle Scholar
  22. Hamerly G (2010) Making k-means even faster. Proc SIAM Int Conf Data Min (SDM) 2010:130–140Google Scholar
  23. Hassanzadeh FF, Milenkovic O (2014) An axiomatic approach to constructing distances for rank comparison and aggregation. IEEE Trans Inf Theory 60(10):6417–6439MathSciNetCrossRefzbMATHGoogle Scholar
  24. Hinneburg A et al (2000) What is the nearest neighbor in high dimensional spaces? Proc Int Conf Very Large Databases (VLDB) 2000:506–515Google Scholar
  25. Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared nearest neighbors. IEEE Trans Comput C–22(11):1025–1034CrossRefGoogle Scholar
  26. Jiang Y et al (2014) The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective. Bioinformatics 30(17):i609–i616CrossRefGoogle Scholar
  27. Kryszkiewicz M, Lasek P (2010) TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality. Proc Int Conf Rough Sets Curr Trends Comput (RSCTC) 2010:60–69CrossRefGoogle Scholar
  28. Kumar R, Vassilvitskii S (2010) Generalized distances between rankings. Proc Int Conf World Wide Web (WWW) 2010:571–580CrossRefGoogle Scholar
  29. LeCam L (1973) Convergence of estimates under dimensionality restrictions. Ann Stat 1(1):38–53MathSciNetCrossRefGoogle Scholar
  30. Li M et al (2004) The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264MathSciNetCrossRefzbMATHGoogle Scholar
  31. Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
  32. Liese F, Vajda I (2006) On divergences and informations in statistics and information theory. IEEE Trans Inform Theory 52(10):4394–4412MathSciNetCrossRefzbMATHGoogle Scholar
  33. Marczewski E, Steinhaus H (1958) On a certain distance of sets and the corresponding distance of functions. Colloq Math 6:319–327MathSciNetCrossRefzbMATHGoogle Scholar
  34. Moore AW (2000) The anchors hierarchy: using the triangle inequality to survive high dimensional data. Proc Conf Uncertain Artif Intell (UAI) 2000:397–405Google Scholar
  35. Movshovitz-Attias Y et al (2015) Ontological supervision for fine grained classification of street view storefronts. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2015:1693–1702Google Scholar
  36. Nehrt NL et al (2011) Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol 7(6):e1002073MathSciNetCrossRefGoogle Scholar
  37. Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the annual meeting on association for computational linguistics (ACL) 2004Google Scholar
  38. Pinsker MS (1964) Information and information stability of random variables and processes. Holden-DayGoogle Scholar
  39. Rachev ST, Römisch W (2002) Quantitative stability in stochastic programming: the method of probability metrics. Math Oper Res 27(4):792–818MathSciNetCrossRefzbMATHGoogle Scholar
  40. Radovanović M et al (2010) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531MathSciNetzbMATHGoogle Scholar
  41. Rao CR (1973) Linear statistical inference and its applications, vol 2. Wiley, HobokenCrossRefzbMATHGoogle Scholar
  42. Rao CR (1982) Diversity and dissimilarity coefficients: a unified approach. Theor Popul Biol 21(1):24–43MathSciNetCrossRefzbMATHGoogle Scholar
  43. Robinson PN, Bauer S (2011) Introduction to bio-ontologies. CRC Press, Boca RatonGoogle Scholar
  44. Rogers MF, Ben-Hur A (2009) The use of gene ontology evidence codes in preventing classifier assessment bias. Bioinformatics 25(9):1173–1177CrossRefGoogle Scholar
  45. Schölkopf B (2000) The kernel trick for distances. Adv Neural Inf Process Syst (NIPS) 2000:301–307Google Scholar
  46. Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
  47. Tan PN et al (2006) Introduction to data mining. Pearson, New YorkGoogle Scholar
  48. Ting KM et al (2016) Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. Proc Int Conf Knowl Discov Data Min (KDD) 2016:1205–1214Google Scholar
  49. Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244zbMATHGoogle Scholar
  50. Wu D et al (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees. PLoS ONE 6(3):e18011CrossRefGoogle Scholar
  51. Wu X, Kumar V (2009) The top ten algorithms in data mining. CRC Press, Boca RatonCrossRefGoogle Scholar
  52. Xing EP et al (2003) Distance metric learning with application to clustering with side-information. Adv Neural Inf Process Syst (NIPS) 2003:521–528Google Scholar
  53. Yang L, Jin R (2006) Distance metric learning: a comprehensive survey. Mich State Univ 2(2):4Google Scholar
  54. Yujian L, Bo L (2007) A normalized Levenshtein distance metric. IEEE Trans Pattern Anal Mach Intell 29(6):1091–1095CrossRefGoogle Scholar
  55. Zolotarev VM (1983) Probability metrics. Teor Veroyatnost i Primenen 28(2):264–287MathSciNetzbMATHGoogle Scholar

Copyright information

© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Indiana UniversityBloomingtonUSA
  2. 2.Northeastern UniversityBostonUSA

Personalised recommendations