Molecular Similarity Measures

  • Gerald M. Maggiora
  • Veerabahu Shanmugasundaram
Part of the Methods in Molecular Biology™ book series (MIMB, volume 275)


Molecular similarity is a pervasive concept in chemistry. It is essential to many aspects of chemical reasoning and analysis and is perhaps the fundamental assumption underlying medicinal chemistry. Dissimilarity, the complement of similarity, also plays a major role in a growing number of applications of molecular diversity in combinatorial chemistry, high-throughput screening, and related fields. How molecular information is represented, called the representation problem, is important to the type of molecular similarity analysis (MSA) that can be carried out in any given situation. In this work, four types of mathematical structure are used to represent molecular information: sets, graphs, vectors, and functions. Molecular similarity is a pairwise relationship that induces structure into sets of molecules, giving rise to the concept of a chemistry space. Although all three concepts—molecular similarity, molecular representation, and chemistry space—are treated in this chapter, the emphasis is on molecular similarity measures. Similarity measures, also called similarity coefficients or indices, are functions that map pairs of compatible molecular representations, that is, representations of the same mathematical form, into real numbers usually, but not always, lying on the unit interval. This chapter presents a somewhat pedagogical discussion of many types of molecular similarity measures, their strengths and limitations, and their relationship to one another.

Key Words

Molecular similarity molecular similarity analyses (MSA) dissimilarity 


  1. 1.
    Rouvray, D. (1990) The evolution of the concept of molecular similarity. In Concepts and applications of molecular similarity, Johnson, M. A. and Maggiora, G. M. (eds.), John Wiley & Sons, New York,  Chapter 2.Google Scholar
  2. 2.
    Sheridan, R. P. and Kearsley, S. K. (2002) Why do we need so many chemical similarity search methods? Drug Discovery Today 7, 903–911.PubMedCrossRefGoogle Scholar
  3. 3.
    Willett, P. (1987) Similarity and clustering in chemical information systems. Research Studies Press, Letchworth.Google Scholar
  4. 4.
    Johnson, M. A. and Maggiora, G. M. (eds.) (1990) Concepts and applications of molecular similarity. John Wiley & Sons, New York.Google Scholar
  5. 5.
    Dean, P. M. (ed.) (1994) Molecular similarity in drug design. Chapman & Hall, Glasgow.Google Scholar
  6. 6.
    Tversky, A. (1977) Features of similarity. Pyschol. Rev. 84, 327–352.CrossRefGoogle Scholar
  7. 7.
    Willett, P., Barnard, J. P., and Downs, G. M. (1998) Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38, 983–996.Google Scholar
  8. 8.
    Johnson, M. A. (1989) A review and examination of mathematical spaces underlying molecular similarity analysis. J. Math. Chem. 3, 117–145.CrossRefGoogle Scholar
  9. 9.
    Borg, I. and Groenen, P. (1997) Modern multidimensional scaling. Springer, New York.Google Scholar
  10. 10.
    Jolliffe, I. T. (2002) Principal component analysis, 2nd ed. Springer, New York.Google Scholar
  11. 11.
    Domine, D., Devillers, J., Chastrette, M., and Karcher, W. (1993). Non-linear mapping for structure-activity and structure-property modeling. J. Chemometrics 7, 227–242.CrossRefGoogle Scholar
  12. 12.
    Rush, J. A. (1999) Cell-based methods for sampling high-dimensional spaces. In Rational drug design, Truhlar, D. G., Howe, W. J., et al. (eds.), Springer, New York, pp. 73–79.Google Scholar
  13. 13.
    Rohrbaugh, R. H. and Jurs, P. C. (1987) Descriptions of molecular shape applied in studies of structure/activity and structure/property relationships. Anal. Chim. Acta 199, 99–109.CrossRefGoogle Scholar
  14. 14.
    Verloop, A. (1987) The STERIMOL approach to drug design. Marcel Dekker, New York.Google Scholar
  15. 15.
    Mulliken, R. S. (1955) Electronic population analysis on LCAO-MO molecular wave functions. I. J. Chem. Phys. 23, 1833–1840.CrossRefGoogle Scholar
  16. 16.
    Stanton, D. T. and Jurs, P. C. (1990) Development and use of charged partial surface area structural descriptors in computer-assisted quantitative structure-property relationship studies. Anal. Chem. 62, 2323–2329.CrossRefGoogle Scholar
  17. 17.
    Kier, L. B. (1989) An index of molecular flexibility from kappa shape attributes. Quant. Struct.-Act. Relat. 8, 221–224.CrossRefGoogle Scholar
  18. 18.
    Kvasnička, V. and Pospichal, J. (1989) Two metrics for a graph-theoretical model of organic chemistry. J. Math. Chem. 3, 161–191.CrossRefGoogle Scholar
  19. 19.
    Kvasnička, V. and Pospichal, J. (1991) Chemical and reaction metrics for graphtheoretical model of organic chemistry. J. Mol. Struct. (Theochem.) 227, 17–42.CrossRefGoogle Scholar
  20. 20.
    Randić, M. (1992) Representation of molecular graphs by basic graphs. J. Chem. Inf. Comput. Sci. 32, 57–69.Google Scholar
  21. 21.
    Baskin, I. I., Skvortsova, M. I., Stankevich, I. V., and Zefirov, N. S. (1995) On the basis of invariants of labeled molecular graphs. J. Chem. Inf. Comput. Sci. 35, 527–531.Google Scholar
  22. 22.
    Skvortsova, M. I., Baskin, I. I., Stankevich, I. V., Palyulin, V. A., and Zefirov, N. S. (1998) Molecular similarity. I. Analytical description of the set of graph similarity measures. J. Chem. Inf. Comput. Sci. 38, 785–790.Google Scholar
  23. 23.
    Ginn, C. M. R., Willett, P., and Bradshaw, J. (2000) Combination of molecular similarity measures using data fusion. Perspec. Drug Disc. Design 20, 1–16.CrossRefGoogle Scholar
  24. 24.
    Trinajstić, N. (1992) Chemical graph theory. CRC Press, Boca Raton, FL.Google Scholar
  25. 25.
    Harary, F. (1969) Graph theory. Addison-Wesley Publishing Company, Reading, MA.Google Scholar
  26. 26.
    Raymond, J. W. and Willett, P. (2002) Maximum common subgraph isomorphism algorithms for the matching of chemical structures. J. Comput.-Aided Mol. Design 16, 521–533.CrossRefGoogle Scholar
  27. 27.
    Mason, J. S., Morize, I., Menard, P. R., Cheney, D. L., Hulme, C., and Labaudiniere, R. F. (1999) New 4-point pharmacophore method for molecular similarity and diversity applications: overview of the method and applications, including a novel approach to the design of combinatorial libraries containing privileged substructures. J. Med. Chem. 42, 3251–3264.PubMedCrossRefGoogle Scholar
  28. 28.
    Devillers, J. and Balaban, A. T. (eds.) (1999) Topological indices and related descriptors in QSAR and QSPR. Gordon and Breach Science Publishers, Amsterdam, The Netherlands.Google Scholar
  29. 29.
    Pearlman, R. S. and Smith, K. M. (1998) Novel software tools for chemical diversity. Perspec. Drug Disc. Design 9/10/11, 339–353.CrossRefGoogle Scholar
  30. 30.
    Mestres, J., Rohrer, D. C., and Maggiora, G. M. (1997) MIMIC: A molecular-field matching program. Exploiting applicability of molecular similarity approaches. J. Comput. Chem. 18, 934–954.CrossRefGoogle Scholar
  31. 31.
    Thorner, D. A., Willett, P., Wright, P. M., and Taylor, R. (1997) Similarity searching in files of three-dimensional chemical structures: Representation and searching of molecular electrostatic potentials using field-graphs. J. Comput.-Aided Mol. Design 11, 163–174.CrossRefGoogle Scholar
  32. 32.
    Du, Q., Arteca, G. A., and Mezey, P. G. (1997) Heuristic lipophilicity potential for computer-aided rational drug design. J. Comput.-Aided Mol. Design 11, 503–515.CrossRefGoogle Scholar
  33. 33.
    Petke, J. D. (1993) Cumulative and discrete similarity analysis of electrostatic potentials and fields. J. Comput. Chem. 14, 928–933.CrossRefGoogle Scholar
  34. 34.
    Cramer, R. D., Patterson, D. E., and Bunce, J. D. (1988) Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Amer. Chem. Soc. 110, 5959–5967.CrossRefGoogle Scholar
  35. 35.
    McGregor, J. and Willett, P. (1981) Use of a maximal common subgraph algorithm in the automatic identification of the ostensible bond changes occurring in chemical reactions. J. Chem. Inf. Comput. Sci. 21, 137–140.Google Scholar
  36. 36.
    Johnson, M. (1985) Relating metrics, lines, and variables defined on graphs to problems in medicinal chemistry. In Graph theory and its applications to algorithms and computer science, Alavi, Y., et al. (eds.), John Wiley & Sons, New York, pp. 457–470.Google Scholar
  37. 37.
    Hagadone, T. R. (1992) Molecular substructure similarity searching: Efficient retrieval in two-dimensional structure databases. J. Chem. Inf. Comput. Sci. 32, 515–521.Google Scholar
  38. 38.
    Rusinko, A., Farmen, M. W., Lambert, C. G., and Young, S. S. (1997) SCAM: Statistical classification of activities of molecules using recursive partitioning. 213th ACS Natl. Meeting, San Francisco, CA, CINF 068.Google Scholar
  39. 39.
    James, C. A., Weininger, D., and Delany, J. (2002) Daylight theory manual. Daylight Chemical Information Systems, Inc.Google Scholar
  40. 40.
    Kanerva, P. (1990) Sparse distributed memory. MIT Press, Cambridge, MA, pp. 26–27.Google Scholar
  41. 41.
    Klir, G. J. and Yuan, B. (1995) Fuzzy sets and fuzzy logic: theory and applications. Prentice Hall PTR, Upper Saddle River, NJ.Google Scholar
  42. 42.
    Miyamoto, S. (1990) Fuzzy sets in information retrieval and cluster analysis. Kluwer Academic Publishers, Dordrecht, The Netherlands.Google Scholar
  43. 43.
    Maggiora, G. M., Petke, J. D., and Mestres, J. (2002) A general analysis of fieldbased molecular similarity indices. J. Math. Chem. 31, 251–270.CrossRefGoogle Scholar
  44. 44.
    Hurst, T. and Heritage, T. (1997) HQSAR—A highly predictive QSAR technique based on molecular holograms. 213th ACS Natl. Meeting, San Francisco, CA, CINF 019.Google Scholar
  45. 45.
    Schneider, G., Neidhart, W., Giller, T., and Schmid, G. (1999) “Scaffold-hopping” by topological pharmacophore search: A contribution to virtual screening. Angew. Chem. Int. Ed. 38, 2894–2896.CrossRefGoogle Scholar
  46. 46.
    Xue, L., Godden, J. W., and Bajorath, J. (1999) Database searching for compounds with similar biological activity using short binary bit string representations of molecules. J. Chem. Inf. Comput. Sci. 39, 881–886.PubMedGoogle Scholar
  47. 47.
    Hyvarinen, A., Karhunen, J., and Oja, E. (2001) Independent component analysis. John Wiley & Sons, New York.CrossRefGoogle Scholar
  48. 48.
    Kay, D. C. (1988) Theory and problems of tensor calculus, Schaum’s Outline Series. McGraw-Hill, New York.Google Scholar
  49. 49.
    Hodgkin, E. E. and Richards, W. G. (1987) Molecular similarity based on electrostatic potential and electric fields. Int. J. Quantum Chem.: Quantum Biol. Symp. 14, 105–110.CrossRefGoogle Scholar
  50. 50.
    Szabo, A. and Ostlund, N. S. (1982) Modern quantum chemistry—introduction to advanced electronic structure theory. Macmillan Publishing Company, New York.Google Scholar
  51. 51.
    Löwdin, P. O. (1992) On linear algebra, the least square method, and the search for linear relations by regression analysis in quantum chemistry and other sciences. Adv. Quantum Chem. 23, 83–126.CrossRefGoogle Scholar
  52. 52.
    Carlson, B. C. and Keller, J. M. (1957) Orthogonalization procedures and the localization of Wannier functions. Phys. Rev. 105, 102–103.CrossRefGoogle Scholar
  53. 53.
    Agrafiotis, D. K., Rassokhin, D. N., and Lobanov, V. S. (2001) Multi-dimensional scaling and visualization of large molecular similarity tables. J. Comput. Chem. 22, 1–13.Google Scholar
  54. 54.
    Kauvar, L. M., Higgins, D. L., Villar, H. O., et al. (1995) Predicting ligand binding to proteins by affinity fingerprinting. Chemistry & Biology 2, 107–118.CrossRefGoogle Scholar
  55. 55.
    Randic, M. (1991) Resolution of ambiguities in structure-property studies by use of orthogonalized descriptors. J. Chem. Inf. Comput. Sci. 31, 311–320.Google Scholar
  56. 56.
    Randic, M. (1991) Correlation of enthalpy of octanes with orthogonal connectivity indices. J. Mol. Struct. (Theochem.) 233, 45–59.CrossRefGoogle Scholar
  57. 57.
    Randic, M. (1993) Fitting non-linear regressions by orthogonalized power series. J. Comput. Chem. 14, 363–370.CrossRefGoogle Scholar
  58. 58.
    Lemmen, C. and Lengauer, T. (2000) Computational methods for the structural alignment of molecules. J. Comput.-Aided Mol. Design 14, 215–232.CrossRefGoogle Scholar
  59. 59.
    Güner, O. F. (ed.) (2000) Pharmacophore perception, development and use in drug design. International University Line, La Jolla, CA.Google Scholar
  60. 60.
    Mansfield, M. L., Covell, D. G., and Jernigan, R. L. (2002) A new class of molecular shape descriptors. Theory and properties. J. Chem. Inf. Comput. Sci. 42, 259–273.PubMedGoogle Scholar
  61. 61.
    Blinn, J. R., Rohrer, D. C., and Maggiora, G. M. (1998) Field-based similarity forcing in energy minimization and molecular matching. In Pacific symposium on biocomputing ′99, Altman, R. B., et al. (eds.), World Scientific, Singapore, pp. 415–424.Google Scholar
  62. 62.
    Labute, P. (1999) Flexible alignment of small molecules. J. Chem. Comput. Group, Spring 1999 Edition [].
  63. 63.
    Christoffersen, R. E. and Maggiora, G. M. (1969) Ab initio calculations on large molecules using molecular fragments. Preliminary investigations. Chem. Phys. Letts. 3, 419–423.CrossRefGoogle Scholar
  64. 64.
    Kearsley, S. K. and Smith, G. M. (1990) An alternative method for the alignment of molecular structures: Maximizing electrostatic and steric overlap. Tetrahedron Comput. Meth. 3, 615–633.CrossRefGoogle Scholar
  65. 65.
    Lemmen, C., Hiller, C., and Lengauer, T. (1998) RigFit: A new approach to superimposing ligand molecules. J. Comput.-Aided Mol. Design 12, 491–502.CrossRefGoogle Scholar
  66. 66.
    Good, A. C., Hodgkin, E. E., and Richards, W. G. (1992) Utilization of Gaussian functions for the rapid evaluation of molecular similarity. J. Chem. Inf. Comput. Sci. 32, 188–191.Google Scholar
  67. 67.
    Carbó, R. and Calabuig, B. (1990) Molecular similarity and quantum chemistry. In Concepts and applications of molecular similarity, Johnson, M. A. and Maggiora, G. M. (eds.), Wiley-Interscience, New York, pp. 147–171.Google Scholar
  68. 68.
    Petitjean, M. (1995) Geometric molecular similarity from volume based distance minimization: Application to Saxitoxin and Tetrodotoxin. J. Comput. Chem. 16, 80–90.CrossRefGoogle Scholar
  69. 69.
    Petitjean, M. (1996) Three-dimensional pattern recognition from molecular distance minimization. J. Chem. Inf. Comput. Sci. 36, 1038–1049.Google Scholar
  70. 70.
    Nissink, J. W. M., Verdonk, M. L., Kroon, J., Mietzner, T., and Klebe, G. (1997) Superposition of molecules: Electron density fitting by application of Fourier transforms. J. Comput. Chem. 18, 638–645.CrossRefGoogle Scholar
  71. 71.
    Mestres, J., Rohrer, D. C., and Maggiora, G. M. (1999) A molecular-field-based similarity study of non-nucleoside HIV-1 reverse transcriptase inhibitors. J. Comput.-Aided Mol. Design 13, 79–93.CrossRefGoogle Scholar
  72. 72.
    Martin, Y. C. (2001) Diverse viewpoints on computational aspects of molecular diversity. J. Comb. Chem. 3, 231–250.PubMedCrossRefGoogle Scholar
  73. 73.
    Patterson, D. E., Cramer, R. D., Ferguson, A. M., Clark, R. D., and Weinberger, L. E. (1996) Neighborhood behavior: A useful concept for validation of molecular diversity. J. Med. Chem. 39, 3049–3059.PubMedCrossRefGoogle Scholar
  74. 74.
    Bellman, R. E. (1961) Adaptive control processes. Princeton University Press, Princeton, NJ.Google Scholar
  75. 75.
    Hastie, T., Tibshirani, R., and Friedman, J. (2001) The elements of statistical learning. Springer, New York.Google Scholar
  76. 76.
    Bishop, C. (1995) Neural networks for pattern recognition. Clarendon Press, Oxford.Google Scholar
  77. 77.
    Walker, P. D., Maggiora, G. M., Johnson, M. A., Petke, J. D., and Mezey, P. G. (1995) Shape group-analysis of molecular similarity—Shape similarity of 6-membered aromatic ring-systems. J. Chem. Inf. Comput. Sci. 35, 568–578.Google Scholar
  78. 78.
    Rarey, M. and Dixon, J. S. (1998) Feature trees: A new molecular similarity measure based on tree matching. J. Comput.-Aided Mol. Design 12, 471–490.CrossRefGoogle Scholar
  79. 79.
    Borg, I. and Groenen, P. (1997) Modern multidimensional scaling—theory and applications. Springer, New York.Google Scholar
  80. 80.
    Domine, D., Devillers, J., Chastrette, M., and Karcher, W. (1993) Non-linear mapping for structure-activity and structure-property modelling. J. Chemometrics 7, 227–242.CrossRefGoogle Scholar
  81. 81.
    Agrafiotis, D. K. and Lobanov, V. S. (2000) Nonlinear mapping networks. J. Chem. Inf. Comput. Sci. 40, 1356–1362.PubMedGoogle Scholar
  82. 82.
    Rassokhin, D., Lobanov, V. S., and Agrafiotis, D. K. (2000) Nonlinear mapping of massive data sets by fuzzy clustering and neural networks. J. Comput. Chem. 21, 1–14.CrossRefGoogle Scholar
  83. 83.
    Jolliffe, I. T. (2002) Principal Component Analysis, Second Edition. Springer, New York.Google Scholar
  84. 84.
    Xie, D., Tropsha, A., and Schlick, T. (2000) An efficient projection protocol for chemical databases: Singular value decomposition combined with truncated-Newton minimization. J. Chem. Inf. Comput. Sci. 40, 167–177.PubMedGoogle Scholar
  85. 85.
    Kruskal, J. (1977) The relationship between multidimensional scaling and clustering. In Classification and Clustering, Van Ryzin, J. (ed.), Academic Press, New York.Google Scholar
  86. 86.
    Gower, J. C. (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325–338.Google Scholar
  87. 87.
    Diamantaras, K. I. and Kung, S. Y. (1996) Principal component neural networks—theory and applications. John Wiley & Sons, New York.Google Scholar
  88. 88.
    Benigni, R. and Giuliani, A. Analysis of distance matrices for studying data structures and separating classes. Struct.-Act. Relat. 12, 397–401.Google Scholar
  89. 89.
    Gower, J. C. (1971) A general coefficient of similarity and some of its properties. Biometrics 27, 857–874.CrossRefGoogle Scholar
  90. 90.
    Gower, J. C. (1984) Distance matrices and their Euclidean approximation. In Data analysis and informatics, III, Diday, E., et al. (eds.), Elsevier Science Publishers B.V. (North-Holland), The Netherlands.Google Scholar
  91. 91.
    Gower, J. C. and Legendre, P. (1986) Metric and Euclidean properties of dissimilarity coefficients. J. Classific. 3, 5–48.CrossRefGoogle Scholar
  92. 92.
    Benigni, R. (1994) EVE, a distance-based approach for discriminating nonlinearly separable groups. Quant. Struct.-Act. Relat. 13, 406–411.Google Scholar

Copyright information

© Humana Press Inc. 2004

Authors and Affiliations

  • Gerald M. Maggiora
    • 1
  • Veerabahu Shanmugasundaram
    • 2
  1. 1.Division of Medicinal Chemistry, College of PharmacyUniversity of ArizonaTucsonUSA
  2. 2.Computer Assisted Drug DesignPfizer Global Research and Development Ann ArborUSA

Personalised recommendations