Distance phenomena in high-dimensional chemical descriptor spaces: consequences for similarity-based approaches

  • M Rupp
  • G Schneider
Open Access
Poster presentation


Vector Space Similarity Measure Similarity Coefficient Mathematical Concept Empty Space 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Measuring the (dis)similarity of molecules is, besides descriptor selection, an important factor for many cheminformatics applications like compound ranking, clustering, and, property prediction. In this work, we focus on real-valued vector spaces (as opposed to the binary spaces of, e.g., fingerprints). We demonstrate the severe influence the choice of (dis)similarity measure can have on the results of cheminformatics applications, and provide recommendations for such choices.

We briefly review the mathematical concepts [1] used to measure (dis)similarity in vector spaces, namely norms, metrics, inner products and similarity coefficients, and the relationships between them, employing commonly used [2][3] (dis)similarity measures in cheminformatics as examples.

Then, we present several phenomena (empty space phenomenon, sphere volume related phenomena, distance concentration [4][5][6]) in high-dimensional descriptor spaces which are not encountered in two and three dimensions. These phenomena are theoretically characterized and illustrated with both artificial and real (bioactivity) data examples.


  1. 1.
    Meyer C: Matrix Analysis and Applied Linear Algebra, SIAM, Philadelphia. 2001Google Scholar
  2. 2.
    Leach A, Gillet V: An Introduction to Chemoinformatics. 2003, Springer NetherlandsGoogle Scholar
  3. 3.
    Willett P: J Chem Inf Comput Sci. 1998, 38: 983-996.CrossRefGoogle Scholar
  4. 4.
    Aggarwal C, Hinneburg A, Keim D: ICDT 2001 Proceedings, 2001, LNCS. 1973, 420-434.Google Scholar
  5. 5.
    Beyer K, Goldstein J, Ramakrishnan R, Shaft U: ICDT 1999 Proceedings, LNCS 1540. 1999, 217-235.Google Scholar
  6. 6.
    Francois D, Wertz V, Verleysen M: IEEE Trans Knowl Data Eng. 2007, 19: 873-886. 10.1109/TKDE.2007.1037.CrossRefGoogle Scholar

Copyright information

© Rupp and Schneider; licensee BioMed Central Ltd. 2009

This article is published under license to BioMed Central Ltd.

Authors and Affiliations

  • M Rupp
    • 1
  • G Schneider
    • 1
  1. 1.University of FrankfurtFrankfurt am MainGermany

Personalised recommendations