Abstract
Databases of chemical structures play an increasingly important role in the fine-chemicals industry, e.g. for the development of novel pharamaceuticals and agrochemicals (Ash et al., 1991). These databases contain tens or hundreds of thousands of chemical substances, either in two-dimensional (2D) or in three-dimensional (3D) form, and several different searching mechanisms have been developed to provide access to the data that is stored in them. The most common mechanisms are structure searching, which involves the retrieval of a single specific molecule, and substructure searching, which involves the retrieval of all of those molecules that contain a user-defined partial structure, e.g. a putative pharmacophore pattern. An extended programme of research in the University of Sheffield has sought to develop a complementary means of access, called similarity searching, and this chapter provides an overview of some of the algorithms that have been developed for this purpose since the programme commenced in the early 1980s. Specifically, we are interested in techniques that will allow a user of a chemical database to input a target structure of interest, and then to retrieve those molecules in the database that are structurally most similar to the target molecule. Our programme of research has also considered how cluster-analysis methods can be used for the processing of chemical databases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Adamson, G.W. and Bush, J.A. (1973) A method for the automatic classification of chemical structures. Information Storage and Retrieval, 9, 561–568.
Adamson, G.W. and Bush, J.A. (1975) A comparison of the performance of some similarity and dissimilarity measures in the automatic classification of chemical structures. Journal of Chemical Information and Computer Sciences, 15, 55–58.
Artymiuk, P.J., Grindley, H. M., Park, J.E., Rice, D.W. and Willett, P. (1992) Three-dimensional structural resemblance between leucine aminopeptidase and carboxypeptidase A revealed by graph-theoretical techniques. FEBS Letters, 303, 48–52.
Artymiuk, P.J., Grindley, H.M., Kumar, K., Rice, D.W. and Willett P. (1993) Three-dimensional structural resemblance between the ribonuclease H and connection domains of HIV reverse transcriptase revealed using graph theoretical techniques. FEBS Letters, 324, 15–21.
Artymiuk, P.J, Grindley, H.M, Poirrette, A.R, Rice, D.W., Ujah, E.C. and Willett, P. (1994a) Identification of β-sheet motifs, of ϕ-loops and of patterns of amino-acid residues in three-dimensional protein structures using a subgraph-isomorphism algorithm. Journal of Chemical Information and Computer Sciences, 34, 54–62.
Artymiuk, P.J, Grindley, H.M, MacKenzie, AB., Rice, D.W., Ujah, E.C. and Willett, P. (1994b) protep: a program for graph-theoretic similarity searching of the 3D structures in the Protein Data Bank. In Molecular Similarity and Reactivity: From Quantum Chemical to Phenomenological Appoaches (ed. R. Carbo). In press.
Ash, J.E., Chubb, P.A., Ward, S.E., Welford, S.M. and Willett, P. (1985) Communication, Storage and Retrieval of Chemical Information. Ellis Horwood, Chichester.
Ash, J.E., Warr, W.A. and Willett, P. (eds) (1991) Chemical Structure Systems. Ellis Horwood, Chichester.
Barnard, J.M. and Downs, G.M. (1992) Clustering of chemical structures on the basis of two-dimensional similarity measures. Journal of Chemical Information and Computer Sciences, 32, 644–649.
Bath, P.A., Morris, C.A. and Willett, P. (1993) Effect of standardisation on fragment-based measures of structural similarity. Journal of Chemometrics, 7, 543–550.
Bath, P.A., Poirrette, A.R., Willet, P. and Allen, F.H. (1994) Similarity searching in files of three-dimensional chemical structures: comparison of fragment-based measures of shape similarity. Journal of Chemical Information and Computer Sciences, 34, 141–147.
Bemis, G.W. and Kuntz, I.D. (1992) A fast and efficient method for 2D and 3D molecular shape description. Journal of Computer-Aided Molecular Design, 6, 607–628.
Bernstein, F.C., Koetzle, T.F., Williams, G.J.B, Meyer Jnr, E.F., Brice, M.D, Rodgers, J.R, Kennard, O., Shimanouchi, M. and Tasumi, M. (1977) The Protein Data Bank: a computerbased archival file for macromolecular structures. Journal of Molecular Biology, 112, 535–542.
Blake, C.C.F. and Oatley, S.J. (1977) Protein—DNA and protein—protein hormone interactions in prealbumin: a model of the thyroid hormone nuclear receptor. Nature, 268, 115–120.
Blake, C.C.F., Geisow, M.J. and Oatley, S.J. (1978) Structure of prealbumin: secondary, tertiary and quaternary interactions characterised by Fourier refinement at 1.8 Å. Journal of Molecular Biology, 121, 339–356.
Brint, A.T. and Willett, P. (1987) Algorithms for the identification of three-dimensional maximal common substructures. Journal of Chemical Information and Computer Sciences, 27, 152–158.
Brint, A.T. and Willett, P. (1988) Upperbound procedures for the identification of similar three-dimensional chemical structures. Journal of Computer-Aided Molecular Design, 2, 311–320.
Bron, C. and Kerbosch, J. (1973) Algorithm 457. Finding all cliques of an undirected graph. Communications of the ACM, 16, 575–577.
Bures, M.G., Martin, Y.C. and Willett, P. (1994) Searching techniques for databases of three-dimensional chemical structures. Topics in Stereochemistry. 21, 467–511.
Burt, C., Richards, W.H. and Huxley, P. (1990) The application of molecular similarity calculations. Journal of Computational Chemistry, 11, 1139–1146.
Clark, D.E., Willett, P. and Kenny, P.W. (1992) Pharmacophoric pattern matching in files of three-dimensional chemical structures: use of bounded-distance matrices for the representation and searching of conformationally-flexible moelcules. Journal of Molecular Graphics, 10, 194–204.
Cringean, J.K., Pepperrell, C.A., Poirrette, A.R. and Willett, P. (1990) Selection of screens for three-dimensional substructure searching. Tetrahedron Computer Methodology, 3, 37–46.
Dittmar, P.G., Farmer, N.A., Fisanick, W., Haines, R.C. and Mockus, J. (1983) The cas online search system. 1. General system design and selection, generation and use of search screens. Journal of Chemical Information and Computer Sciences, 23, 93–102.
Downs, G.M. and Willett, P. (1991) The use of similarity and clustering techniques for the prediction of molecular properties. In Applied Multivariate Analysis in SAR and Environmental Studies (eds J. Devillers and W. Karcher), pp. 247–279. European Communities, Brussels.
Downs, G.M. and Willett, P. (1994) Clustering of chemical-structure databases for compound selection. In Chemometric Methods in Molecular Design (ed. H. van de Waterbeemd). VCH, New York. In press.
Downs, G.M., Willett, P. and Fisanick, W. (1994) Similarity searching and clustering of chemical-structure databases using molecular property data. Journal of Chemical Information and Computer Sciences (in press).
Edelbrock, C. (1979) Comparing the accuracy of hierarchical clustering algorithms: the problem of classifying everybody. Multivariate Behavioural Research, 14, 367–384.
El-Hamdouchi, A. and Willett, P. (1989) Comparison of hierarchic agglomerative clustering methods for document retrieval. Computer Journal, 32, 220–227.
Everitt, B.S. (1993) Cluster Analysis, 3rd edn. Edward Arnold, London.
Fisanick, W., Cross, K.P. and Rusinko, A. (1992) Similarity searching on CAS Registry substances. 1. Global molecular property and generic atom triangle geometric searching. Journal of Chemical Information and Computer Sciences, 32, 664–674.
Grindley, H.M., Artymiuk, P.J., Rice, D.W. and Willett, P. (1993) Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. Journal of Molecular Biology, 229, 707–721.
Guenoche, A., Hansen, P. and Jaumard, B. (1991) Efficient algorithms for divisive hierarchical clustering with the diameter criterion. Journal of Classification, 8, 5–30.
Gund, P. (1977) Three-dimensional pharmacophoric pattern searching. Progress in Molecular and Subcellular Biology, 5, 117–143.
Hagadone, T. (1992) Molecular substructure similarity searching: efficient retrieval in twodimensional structure databases. Journal of Chemical Information and Computer Sciences, 32, 515–521.
Harel, D. (1987) Algorithmics: the Spirit of Computing. Addison-Wesley, Reading, Massachusetts.
Hodes, L. (1989) Clustering a large number of compounds. I. Establishing the method on an initial sample. Journal of Chemical Information and Computer Sciences, 29, 66–71.
Jarvis, R.A. and Patrick, E.A. (1973) Clustering using a similarity measure based on shared nearest neighbours. IEEE Transactions on Computers, C-22, 1025–1034.
Johnson, M.A. and Maggiora, G.M. (eds) (1990) Concepts and Applications of Molecular Similarity. John Wiley, New York.
Kim, S.H., de Vos, A. and Ogata, C. (1988) Crystal structures of two intensely sweet proteins. Trends in Biochemical Science, 13, 13–15.
Lajiness, M.S. (1991) An evaluation of the performance of dissimilarity selection. In QSAR: Rational Approaches to the Design of Bioactive Compounds (eds C. Silipo and A. Vittoria), pp. 201–204. Elsevier Science Publishers, Amsterdam.
Lajiness, M.S., Johnson, M.A. and Maggiora, G. (1989) Implementing drug screening programs using molecular similarity methods. In QSAR: Quantitative Structure—Activity Relationships in Drug Design (ed. J.L. Fauchere), pp. 173–176. Alan R. Liss Inc., New York.
Lance, G.N. and Williams, W.T. (1967) A general theory of classificatory sorting strategies. I. Hierarchical systems. Computer Journal, 9, 373–380.
Manaut, F., Sanz, F., Jose, J. and Milesi, M. (1991) Automatic search for maximum similarity between molecular electrostatic potential distributions. Journal of Computer-Aided Molecular Design, 5, 371–380.
Milligan, G.W. (1980) An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.
Milligan, G.W. and Cooper, M.C. (1988) A study of standardisation of variables in cluster analysis. Journal of Classification, 5, 181–204.
Mitchell, E.M., Artymiuk, P.J., Rice, D.W. and Willett, P. (1990) Use of techniques derived from graph theory to compare secondary structure motifs in proteins. Journal of Molecular Biology, 212, 151–166.
Moock, T.E., Grier, D.L., Hounshell, W.D., Grethe, G., Cronin, K., Nourse, J.G. and Theodosiou, J. (1988) Similarity searching in the organic reaction domain. Tetrahedron Computer Methodology, 1, 117–128.
Murtagh, F. (1983) A survey of recent advances in hierarchical clustering algorithms. Computer Journal, 26, 354–359.
Murtagh, F. (1993) Search aglorithms for numeric and quantitative data. In Intelligent Information Retrieval: The Case of Astronomy and Related Space Sciences (eds A. Heck and F. Murtagh), pp. 29–48. Kluwer Academic Publishers, Dordrecht.
Nilakantan, R., Bauman, N. and Venkataraghavan, R. (1993) A new method for rapid characterisation of molecular shape: applications in drug design. Journal of Chemical Information and Computer Sciences, 33, 79–85.
Ohlendorf, D.H., Lipscomb, J.D. and Weber, P.C. (1988) Structure and assembly of protocatechuate 3,4-dioxygenase. Nature, 336, 403–405.
Pepperrell, C.A. and Willett, P. (1991) Techniques for the calculation of three-dimensional structural similarity using inter-atomic distances. Journal of Computer-Aided Molecular Design, 5, 455–474.
Pepperrell, C.A., Willett, P. and Taylor, R. (1990) Implementation and use of an atom-mapping procedure for similarity searching in databases of 3D chemical structures. Tetrahedron Computer Methodology, 3, 575–593.
Pepperrell, C.A., Poirrette, A.R., Willett, P. and Taylor, R. (1991) Development of an atom-mapping procedure for similarity searching in databases of three-dimensional chemical structures. Pesticide Sciences, 33, 97–111.
Perry, N.C and van Geerestein, V. (1992) Database searching on the basis of three-dimensional molecular similarity using the sperm program. Journal of Chemical Information and Computer Sciences, 32, 607–616.
Petke, J.D. (1993) Cumulative and discrete similarity analysis of electrostatic potentials and fields. Journal of Computational Chemistry, 14, 928–933.
Rubin, V. and Willett, P. (1983) A comparison of some hierarchical monothetic divisive clustering algorithms for structure property correlation. Analytica Chimica Acta, 151, 161–166.
Siegel, S. and Castellan, N.J. (1988) Non-Parametric Statistics for the Social Sciences. McGraw-Hill, London.
Sneath, P.H.A. and Sokal, R.R. (1973) Numerical Taxonomy. W.H. Freeman, San Francisco.
Stirk, H.J., Woolfson, D.N., Hutchinson, E.G. and Thornton, J.M. (1992) Depicting topology and handedness in jellyroll structures. FEBS Letters, 308, 1–3.
Stouch, T.R. and Jurs, P.C. (1986) Computer-aided studies of the structure—activity relationships between the structure of some steroids and their anti-inflammatory activity. Journal of Medicinal Chemistry, 29, 2125–2136.
Todeschini, R. (1989) k-nearest neighbour method: the influence of data transformations and metrics. Chemometrics and Intelligent Laboratory Systems, 6, 213–220.
Ujah, E.C. (1992) A Study of ß-Sheet Motifs at Different Levels of Structural Abstraction Using Graph-Theoretic and Dynamic Programming Techniques. PhD thesis, University of Sheffield.
van der Wel, H. and Loeve, K. (1972) Isolation and characterization of thaumatin I and II, the sweet-tasting proteins from Thaumatococcus danielli Benth. European Journal of Biochemistry, 31, 221–225.
Voorhees, E.M. (1986) Implementing agglomerative hierarchical clustering algorithms for use in document retrieval. Information Processing and Management, 22, 465–476.
Wild, D.J. and Willett, P. (1994) Similarity searching in files of three-dimensional chemical structures: implementation of atom mapping on the Distributed Army Processor DAP-610, the MasPar MP-1104 and the Connection Machine CM-200. Journal of Chemical Information and Computer Sciences, 34, 224–231.
Willett, P. (1982a) A comparison of some hierarchical agglomerative clustering algorithms for structure property correlation. Analytica Chimica Acta, 136, 29–37.
Willett, P. (1982b) The calculation of inter-molecular similarity coefficients using an inverted file algorithm. Analytica Chimica Acta, 138, 339–342.
Willett, P. (1983) Some heuristics for nearest neighbour searching in chemical structure files. Journal of Chemical Information and Computer Sciences, 23, 22–25.
Willett, P. (1984) An evaluation of relocation clustering algorithms for the automatic classification of chemical structures. Journal of Chemical Information and Computer Sciences, 24, 29–33.
Willett, P. (1987a) A review of chemical structure retrieval systems. Journal of Chemometrics, 1, 139–155.
Willett, P. (1987b) Similarity and Clustering in Chemical Information Systems. Research Studies Press, Letchworth.
Willett, P. (1992) A review of three-dimensional chemical structure retrieval systems. Journal of Chemometrics 6, 289–305.
Willett, P. and Winterman, V. (1986) A comparison of some measures for the determination of inter-molecular structural similarity. Quantitative Structure—Activity Relationships, 5, 18–25.
Willett, P., Winterman, V. and Bawden, D. (1986a) Implementation of nearest-neighbour searching in an online chemical structure search system. Journal of Chemical Information and Computer Sciences, 26, 36–41.
Willett, P., Winterman, V. and Bawden, D. (1986b) Implementation of non-hierarchic cluster analysis methods in chemical information systems: selection of compounds for biological testing and clustering of substructure search output. Journal of Chemical Information and Computer Sciences, 26, 109–118.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1995 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Willett, P. (1995). Similarity-searching and clustering algorithms for processing databases of two-dimensional and three-dimensional chemical structures. In: Dean, P.M. (eds) Molecular Similarity in Drug Design. Springer, Dordrecht. https://doi.org/10.1007/978-94-011-1350-2_5
Download citation
DOI: https://doi.org/10.1007/978-94-011-1350-2_5
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-010-4589-6
Online ISBN: 978-94-011-1350-2
eBook Packages: Springer Book Archive