Skip to main content

A Tale of Four Metrics

  • Conference paper
  • First Online:
Book cover Similarity Search and Applications (SISAP 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9939))

Included in the following conference series:

Abstract

There are many contexts where the definition of similarity in multivariate space requires to be based on the correlation, rather than absolute value, of the variables. Examples include classic IR measurements such as TDF/IF and BM25, client similarity measures based on collaborative filtering, feature analysis of chemical molecules, and biodiversity contexts.

In such cases, it is almost standard for Cosine similarity to be used. More recently, Jensen-Shannon divergence has appeared in a proper metric form, and a related metric Structural Entropic Distance (SED) has been investigated. A fourth metric, based on a little-known divergence function named as Triangular Divergence, is also assessed here.

For these metrics, we study their properties in the context of similarity and metric search. We compare and contrast their semantics and performance. Our conclusion is that, despite Cosine Distance being an almost automatic choice in this context, Triangular Distance is most likely to be the best choice in terms of a compromise between semantics and performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Some functions are formally undefined in the presence of zero values, requiring either \(0\log 0\) or 0/0. In each case, there is in fact a good mathematical argument for treating these terms as 0 rather than undefined.

  2. 2.

    See [4] for an explanation of this constant.

  3. 3.

    In fact Shannon’s entropy raised to the power of the logarithm base, see [4] for details.

  4. 4.

    118 dimensions.

  5. 5.

    Aailable at https://bitbucket.org/richardconnor/metric-space-framework.

References

  1. Connor, R., Cardillo, F.A., Vadicamo, L., Rabitti, F.: Hilbert Exclusion: Improved Metric Search Through Finite Isometric Embeddings. ArXiv e-prints, accepted for publication ACM TOIS, April 2016

    Google Scholar 

  2. Connor, R., Cardillo, F.A., Vadicamo, L., Rabitti, F.: Supermetric Search with the Four-Point Property. Accepted for publication SISAP, Tokyo, Japan, October 2016

    Google Scholar 

  3. Connor, R., Moss, R.: A multivariate correlation distance for vector spaces. In: Navarro, G., Pestov, V. (eds.) SISAP 2012. LNCS, vol. 7404, pp. 209–225. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  4. Connor, R., Simeoni, F., Iakovos, M., Moss, R.: A bounded distance metric for comparing tree structure. Inf. Syst. 36(4), 748–764 (2011)

    Article  Google Scholar 

  5. Connor, R., Moss, R., Harvey, M.: A new probabilistic ranking model. In: Proceedings of the 2013 Conference on the Theory of Information Retrieval, ICTIR 2013, p. 23: 109–23: 112, NY, USA (2013). http://doi.acm.org/10.1145/2499178.2499185

  6. Endres, D., Schindelin, J.: A new metric for probability distributions. IEEE Trans. Inf. Theor. 49(7), 1858–1860 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  7. Fuglede, B., Topsoe, F.: Jensen-Shannon divergence and Hilbert space embedding. In: Proceedings of International Symposium on Information Theory, ISIT 2004, p. 31 (2004)

    Google Scholar 

  8. Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments: part 2. Inf. Process. Manag. 36(6), 809–840 (2000)

    Article  Google Scholar 

  9. Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theor. 37(1), 145–151 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  10. Österreicher, F., Vajda, I.: A new class of metric divergences on probability spaces and and its statistical applications. Ann. Inst. Stat. Math. 55, 639–653 (2003)

    Article  MATH  Google Scholar 

  11. Singhal, A.: Modern information retrieval: a brief overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001)

    Google Scholar 

  12. Topsoe, F.: Some inequalities for information divergence and related measures of discrimination. IEEE Trans. Inf. Theor. 46(4), 1602–1609 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  13. Topsøe, F.: Jenson-Shannon divergence and norm-based measures of discrimination and variation. Preprint math.ku.dk (2003)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Richard Connor .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Connor, R. (2016). A Tale of Four Metrics. In: Amsaleg, L., Houle, M., Schubert, E. (eds) Similarity Search and Applications. SISAP 2016. Lecture Notes in Computer Science(), vol 9939. Springer, Cham. https://doi.org/10.1007/978-3-319-46759-7_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46759-7_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46758-0

  • Online ISBN: 978-3-319-46759-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics