Skip to main content

Similarity-searching and clustering algorithms for processing databases of two-dimensional and three-dimensional chemical structures

  • Chapter
Molecular Similarity in Drug Design

Abstract

Databases of chemical structures play an increasingly important role in the fine-chemicals industry, e.g. for the development of novel pharamaceuticals and agrochemicals (Ash et al., 1991). These databases contain tens or hundreds of thousands of chemical substances, either in two-dimensional (2D) or in three-dimensional (3D) form, and several different searching mechanisms have been developed to provide access to the data that is stored in them. The most common mechanisms are structure searching, which involves the retrieval of a single specific molecule, and substructure searching, which involves the retrieval of all of those molecules that contain a user-defined partial structure, e.g. a putative pharmacophore pattern. An extended programme of research in the University of Sheffield has sought to develop a complementary means of access, called similarity searching, and this chapter provides an overview of some of the algorithms that have been developed for this purpose since the programme commenced in the early 1980s. Specifically, we are interested in techniques that will allow a user of a chemical database to input a target structure of interest, and then to retrieve those molecules in the database that are structurally most similar to the target molecule. Our programme of research has also considered how cluster-analysis methods can be used for the processing of chemical databases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Adamson, G.W. and Bush, J.A. (1973) A method for the automatic classification of chemical structures. Information Storage and Retrieval, 9, 561–568.

    Article  CAS  Google Scholar 

  • Adamson, G.W. and Bush, J.A. (1975) A comparison of the performance of some similarity and dissimilarity measures in the automatic classification of chemical structures. Journal of Chemical Information and Computer Sciences, 15, 55–58.

    Article  CAS  Google Scholar 

  • Artymiuk, P.J., Grindley, H. M., Park, J.E., Rice, D.W. and Willett, P. (1992) Three-dimensional structural resemblance between leucine aminopeptidase and carboxypeptidase A revealed by graph-theoretical techniques. FEBS Letters, 303, 48–52.

    Article  CAS  Google Scholar 

  • Artymiuk, P.J., Grindley, H.M., Kumar, K., Rice, D.W. and Willett P. (1993) Three-dimensional structural resemblance between the ribonuclease H and connection domains of HIV reverse transcriptase revealed using graph theoretical techniques. FEBS Letters, 324, 15–21.

    Article  CAS  Google Scholar 

  • Artymiuk, P.J, Grindley, H.M, Poirrette, A.R, Rice, D.W., Ujah, E.C. and Willett, P. (1994a) Identification of β-sheet motifs, of Ï•-loops and of patterns of amino-acid residues in three-dimensional protein structures using a subgraph-isomorphism algorithm. Journal of Chemical Information and Computer Sciences, 34, 54–62.

    Article  CAS  Google Scholar 

  • Artymiuk, P.J, Grindley, H.M, MacKenzie, AB., Rice, D.W., Ujah, E.C. and Willett, P. (1994b) protep: a program for graph-theoretic similarity searching of the 3D structures in the Protein Data Bank. In Molecular Similarity and Reactivity: From Quantum Chemical to Phenomenological Appoaches (ed. R. Carbo). In press.

    Google Scholar 

  • Ash, J.E., Chubb, P.A., Ward, S.E., Welford, S.M. and Willett, P. (1985) Communication, Storage and Retrieval of Chemical Information. Ellis Horwood, Chichester.

    Google Scholar 

  • Ash, J.E., Warr, W.A. and Willett, P. (eds) (1991) Chemical Structure Systems. Ellis Horwood, Chichester.

    Google Scholar 

  • Barnard, J.M. and Downs, G.M. (1992) Clustering of chemical structures on the basis of two-dimensional similarity measures. Journal of Chemical Information and Computer Sciences, 32, 644–649.

    Article  CAS  Google Scholar 

  • Bath, P.A., Morris, C.A. and Willett, P. (1993) Effect of standardisation on fragment-based measures of structural similarity. Journal of Chemometrics, 7, 543–550.

    Article  CAS  Google Scholar 

  • Bath, P.A., Poirrette, A.R., Willet, P. and Allen, F.H. (1994) Similarity searching in files of three-dimensional chemical structures: comparison of fragment-based measures of shape similarity. Journal of Chemical Information and Computer Sciences, 34, 141–147.

    Article  CAS  Google Scholar 

  • Bemis, G.W. and Kuntz, I.D. (1992) A fast and efficient method for 2D and 3D molecular shape description. Journal of Computer-Aided Molecular Design, 6, 607–628.

    Article  CAS  Google Scholar 

  • Bernstein, F.C., Koetzle, T.F., Williams, G.J.B, Meyer Jnr, E.F., Brice, M.D, Rodgers, J.R, Kennard, O., Shimanouchi, M. and Tasumi, M. (1977) The Protein Data Bank: a computerbased archival file for macromolecular structures. Journal of Molecular Biology, 112, 535–542.

    Google Scholar 

  • Blake, C.C.F. and Oatley, S.J. (1977) Protein—DNA and protein—protein hormone interactions in prealbumin: a model of the thyroid hormone nuclear receptor. Nature, 268, 115–120.

    Article  CAS  Google Scholar 

  • Blake, C.C.F., Geisow, M.J. and Oatley, S.J. (1978) Structure of prealbumin: secondary, tertiary and quaternary interactions characterised by Fourier refinement at 1.8 Ã…. Journal of Molecular Biology, 121, 339–356.

    Article  CAS  Google Scholar 

  • Brint, A.T. and Willett, P. (1987) Algorithms for the identification of three-dimensional maximal common substructures. Journal of Chemical Information and Computer Sciences, 27, 152–158.

    Article  CAS  Google Scholar 

  • Brint, A.T. and Willett, P. (1988) Upperbound procedures for the identification of similar three-dimensional chemical structures. Journal of Computer-Aided Molecular Design, 2, 311–320.

    Article  Google Scholar 

  • Bron, C. and Kerbosch, J. (1973) Algorithm 457. Finding all cliques of an undirected graph. Communications of the ACM, 16, 575–577.

    Article  Google Scholar 

  • Bures, M.G., Martin, Y.C. and Willett, P. (1994) Searching techniques for databases of three-dimensional chemical structures. Topics in Stereochemistry. 21, 467–511.

    Article  CAS  Google Scholar 

  • Burt, C., Richards, W.H. and Huxley, P. (1990) The application of molecular similarity calculations. Journal of Computational Chemistry, 11, 1139–1146.

    Article  CAS  Google Scholar 

  • Clark, D.E., Willett, P. and Kenny, P.W. (1992) Pharmacophoric pattern matching in files of three-dimensional chemical structures: use of bounded-distance matrices for the representation and searching of conformationally-flexible moelcules. Journal of Molecular Graphics, 10, 194–204.

    Article  CAS  Google Scholar 

  • Cringean, J.K., Pepperrell, C.A., Poirrette, A.R. and Willett, P. (1990) Selection of screens for three-dimensional substructure searching. Tetrahedron Computer Methodology, 3, 37–46.

    Article  CAS  Google Scholar 

  • Dittmar, P.G., Farmer, N.A., Fisanick, W., Haines, R.C. and Mockus, J. (1983) The cas online search system. 1. General system design and selection, generation and use of search screens. Journal of Chemical Information and Computer Sciences, 23, 93–102.

    Article  CAS  Google Scholar 

  • Downs, G.M. and Willett, P. (1991) The use of similarity and clustering techniques for the prediction of molecular properties. In Applied Multivariate Analysis in SAR and Environmental Studies (eds J. Devillers and W. Karcher), pp. 247–279. European Communities, Brussels.

    Chapter  Google Scholar 

  • Downs, G.M. and Willett, P. (1994) Clustering of chemical-structure databases for compound selection. In Chemometric Methods in Molecular Design (ed. H. van de Waterbeemd). VCH, New York. In press.

    Google Scholar 

  • Downs, G.M., Willett, P. and Fisanick, W. (1994) Similarity searching and clustering of chemical-structure databases using molecular property data. Journal of Chemical Information and Computer Sciences (in press).

    Google Scholar 

  • Edelbrock, C. (1979) Comparing the accuracy of hierarchical clustering algorithms: the problem of classifying everybody. Multivariate Behavioural Research, 14, 367–384.

    Article  Google Scholar 

  • El-Hamdouchi, A. and Willett, P. (1989) Comparison of hierarchic agglomerative clustering methods for document retrieval. Computer Journal, 32, 220–227.

    Article  Google Scholar 

  • Everitt, B.S. (1993) Cluster Analysis, 3rd edn. Edward Arnold, London.

    Google Scholar 

  • Fisanick, W., Cross, K.P. and Rusinko, A. (1992) Similarity searching on CAS Registry substances. 1. Global molecular property and generic atom triangle geometric searching. Journal of Chemical Information and Computer Sciences, 32, 664–674.

    Article  CAS  Google Scholar 

  • Grindley, H.M., Artymiuk, P.J., Rice, D.W. and Willett, P. (1993) Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. Journal of Molecular Biology, 229, 707–721.

    Article  CAS  Google Scholar 

  • Guenoche, A., Hansen, P. and Jaumard, B. (1991) Efficient algorithms for divisive hierarchical clustering with the diameter criterion. Journal of Classification, 8, 5–30.

    Article  Google Scholar 

  • Gund, P. (1977) Three-dimensional pharmacophoric pattern searching. Progress in Molecular and Subcellular Biology, 5, 117–143.

    Article  CAS  Google Scholar 

  • Hagadone, T. (1992) Molecular substructure similarity searching: efficient retrieval in twodimensional structure databases. Journal of Chemical Information and Computer Sciences, 32, 515–521.

    Article  CAS  Google Scholar 

  • Harel, D. (1987) Algorithmics: the Spirit of Computing. Addison-Wesley, Reading, Massachusetts.

    Google Scholar 

  • Hodes, L. (1989) Clustering a large number of compounds. I. Establishing the method on an initial sample. Journal of Chemical Information and Computer Sciences, 29, 66–71.

    Article  CAS  Google Scholar 

  • Jarvis, R.A. and Patrick, E.A. (1973) Clustering using a similarity measure based on shared nearest neighbours. IEEE Transactions on Computers, C-22, 1025–1034.

    Article  Google Scholar 

  • Johnson, M.A. and Maggiora, G.M. (eds) (1990) Concepts and Applications of Molecular Similarity. John Wiley, New York.

    Google Scholar 

  • Kim, S.H., de Vos, A. and Ogata, C. (1988) Crystal structures of two intensely sweet proteins. Trends in Biochemical Science, 13, 13–15.

    Article  CAS  Google Scholar 

  • Lajiness, M.S. (1991) An evaluation of the performance of dissimilarity selection. In QSAR: Rational Approaches to the Design of Bioactive Compounds (eds C. Silipo and A. Vittoria), pp. 201–204. Elsevier Science Publishers, Amsterdam.

    Google Scholar 

  • Lajiness, M.S., Johnson, M.A. and Maggiora, G. (1989) Implementing drug screening programs using molecular similarity methods. In QSAR: Quantitative Structure—Activity Relationships in Drug Design (ed. J.L. Fauchere), pp. 173–176. Alan R. Liss Inc., New York.

    Google Scholar 

  • Lance, G.N. and Williams, W.T. (1967) A general theory of classificatory sorting strategies. I. Hierarchical systems. Computer Journal, 9, 373–380.

    Google Scholar 

  • Manaut, F., Sanz, F., Jose, J. and Milesi, M. (1991) Automatic search for maximum similarity between molecular electrostatic potential distributions. Journal of Computer-Aided Molecular Design, 5, 371–380.

    Article  CAS  Google Scholar 

  • Milligan, G.W. (1980) An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.

    Article  Google Scholar 

  • Milligan, G.W. and Cooper, M.C. (1988) A study of standardisation of variables in cluster analysis. Journal of Classification, 5, 181–204.

    Article  Google Scholar 

  • Mitchell, E.M., Artymiuk, P.J., Rice, D.W. and Willett, P. (1990) Use of techniques derived from graph theory to compare secondary structure motifs in proteins. Journal of Molecular Biology, 212, 151–166.

    Article  CAS  Google Scholar 

  • Moock, T.E., Grier, D.L., Hounshell, W.D., Grethe, G., Cronin, K., Nourse, J.G. and Theodosiou, J. (1988) Similarity searching in the organic reaction domain. Tetrahedron Computer Methodology, 1, 117–128.

    Article  CAS  Google Scholar 

  • Murtagh, F. (1983) A survey of recent advances in hierarchical clustering algorithms. Computer Journal, 26, 354–359.

    Google Scholar 

  • Murtagh, F. (1993) Search aglorithms for numeric and quantitative data. In Intelligent Information Retrieval: The Case of Astronomy and Related Space Sciences (eds A. Heck and F. Murtagh), pp. 29–48. Kluwer Academic Publishers, Dordrecht.

    Chapter  Google Scholar 

  • Nilakantan, R., Bauman, N. and Venkataraghavan, R. (1993) A new method for rapid characterisation of molecular shape: applications in drug design. Journal of Chemical Information and Computer Sciences, 33, 79–85.

    Article  CAS  Google Scholar 

  • Ohlendorf, D.H., Lipscomb, J.D. and Weber, P.C. (1988) Structure and assembly of protocatechuate 3,4-dioxygenase. Nature, 336, 403–405.

    Article  CAS  Google Scholar 

  • Pepperrell, C.A. and Willett, P. (1991) Techniques for the calculation of three-dimensional structural similarity using inter-atomic distances. Journal of Computer-Aided Molecular Design, 5, 455–474.

    Article  CAS  Google Scholar 

  • Pepperrell, C.A., Willett, P. and Taylor, R. (1990) Implementation and use of an atom-mapping procedure for similarity searching in databases of 3D chemical structures. Tetrahedron Computer Methodology, 3, 575–593.

    Article  CAS  Google Scholar 

  • Pepperrell, C.A., Poirrette, A.R., Willett, P. and Taylor, R. (1991) Development of an atom-mapping procedure for similarity searching in databases of three-dimensional chemical structures. Pesticide Sciences, 33, 97–111.

    Article  CAS  Google Scholar 

  • Perry, N.C and van Geerestein, V. (1992) Database searching on the basis of three-dimensional molecular similarity using the sperm program. Journal of Chemical Information and Computer Sciences, 32, 607–616.

    Article  CAS  Google Scholar 

  • Petke, J.D. (1993) Cumulative and discrete similarity analysis of electrostatic potentials and fields. Journal of Computational Chemistry, 14, 928–933.

    Article  CAS  Google Scholar 

  • Rubin, V. and Willett, P. (1983) A comparison of some hierarchical monothetic divisive clustering algorithms for structure property correlation. Analytica Chimica Acta, 151, 161–166.

    Article  CAS  Google Scholar 

  • Siegel, S. and Castellan, N.J. (1988) Non-Parametric Statistics for the Social Sciences. McGraw-Hill, London.

    Google Scholar 

  • Sneath, P.H.A. and Sokal, R.R. (1973) Numerical Taxonomy. W.H. Freeman, San Francisco.

    Google Scholar 

  • Stirk, H.J., Woolfson, D.N., Hutchinson, E.G. and Thornton, J.M. (1992) Depicting topology and handedness in jellyroll structures. FEBS Letters, 308, 1–3.

    Article  CAS  Google Scholar 

  • Stouch, T.R. and Jurs, P.C. (1986) Computer-aided studies of the structure—activity relationships between the structure of some steroids and their anti-inflammatory activity. Journal of Medicinal Chemistry, 29, 2125–2136.

    Article  CAS  Google Scholar 

  • Todeschini, R. (1989) k-nearest neighbour method: the influence of data transformations and metrics. Chemometrics and Intelligent Laboratory Systems, 6, 213–220.

    Article  Google Scholar 

  • Ujah, E.C. (1992) A Study of ß-Sheet Motifs at Different Levels of Structural Abstraction Using Graph-Theoretic and Dynamic Programming Techniques. PhD thesis, University of Sheffield.

    Google Scholar 

  • van der Wel, H. and Loeve, K. (1972) Isolation and characterization of thaumatin I and II, the sweet-tasting proteins from Thaumatococcus danielli Benth. European Journal of Biochemistry, 31, 221–225.

    Article  Google Scholar 

  • Voorhees, E.M. (1986) Implementing agglomerative hierarchical clustering algorithms for use in document retrieval. Information Processing and Management, 22, 465–476.

    Article  Google Scholar 

  • Wild, D.J. and Willett, P. (1994) Similarity searching in files of three-dimensional chemical structures: implementation of atom mapping on the Distributed Army Processor DAP-610, the MasPar MP-1104 and the Connection Machine CM-200. Journal of Chemical Information and Computer Sciences, 34, 224–231.

    Article  CAS  Google Scholar 

  • Willett, P. (1982a) A comparison of some hierarchical agglomerative clustering algorithms for structure property correlation. Analytica Chimica Acta, 136, 29–37.

    Article  CAS  Google Scholar 

  • Willett, P. (1982b) The calculation of inter-molecular similarity coefficients using an inverted file algorithm. Analytica Chimica Acta, 138, 339–342.

    Article  Google Scholar 

  • Willett, P. (1983) Some heuristics for nearest neighbour searching in chemical structure files. Journal of Chemical Information and Computer Sciences, 23, 22–25.

    Article  CAS  Google Scholar 

  • Willett, P. (1984) An evaluation of relocation clustering algorithms for the automatic classification of chemical structures. Journal of Chemical Information and Computer Sciences, 24, 29–33.

    Article  CAS  Google Scholar 

  • Willett, P. (1987a) A review of chemical structure retrieval systems. Journal of Chemometrics, 1, 139–155.

    Article  CAS  Google Scholar 

  • Willett, P. (1987b) Similarity and Clustering in Chemical Information Systems. Research Studies Press, Letchworth.

    Google Scholar 

  • Willett, P. (1992) A review of three-dimensional chemical structure retrieval systems. Journal of Chemometrics 6, 289–305.

    Article  CAS  Google Scholar 

  • Willett, P. and Winterman, V. (1986) A comparison of some measures for the determination of inter-molecular structural similarity. Quantitative Structure—Activity Relationships, 5, 18–25.

    Article  CAS  Google Scholar 

  • Willett, P., Winterman, V. and Bawden, D. (1986a) Implementation of nearest-neighbour searching in an online chemical structure search system. Journal of Chemical Information and Computer Sciences, 26, 36–41.

    Article  CAS  Google Scholar 

  • Willett, P., Winterman, V. and Bawden, D. (1986b) Implementation of non-hierarchic cluster analysis methods in chemical information systems: selection of compounds for biological testing and clustering of substructure search output. Journal of Chemical Information and Computer Sciences, 26, 109–118.

    Article  CAS  Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1995 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Willett, P. (1995). Similarity-searching and clustering algorithms for processing databases of two-dimensional and three-dimensional chemical structures. In: Dean, P.M. (eds) Molecular Similarity in Drug Design. Springer, Dordrecht. https://doi.org/10.1007/978-94-011-1350-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-94-011-1350-2_5

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-010-4589-6

  • Online ISBN: 978-94-011-1350-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics