Abstract
The concepts of diversity and similarity of molecules are widely used in quantitative methods for designing (selecting) a representative set of molecules and for analyzing the relationship between chemical structure and biological activity. We review methods and algorithms for design of a diverse set of molecules in the chemical space using clustering, cell-based partitioning, or other distance-based approaches. Analogous cell-based and clustering methods are described for analyzing drug-discovery data to predict activity in virtual screening. Some performance comparisons are made. The choice of descriptor variables to characterize chemical structure is also included in the comparative study. We find that the diversity of a selected set is quite sensitive to both the statistical selection method and the choice of molecular descriptors and that, for the dataset used in this study, random selection works surprisingly well in providing a set of data for analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abt, M., Lim, Y.-B., Sacks, J., Xie, M., and Young, S. S. (2001) A sequential approach for identifying lead compounds in large chemical databases. Stat. Sci. 16, 154–168.
Engels, M. F. M. and Venkatarangan, P. (2001) Smart screening: approaches to efficient HTS. Curr. Opin. Drug Disc. Dev. 4, 275–283.
Jones-Hertzog, D. K., Mukhopadhyay, P., Keefer, C. E., and Young, S. S. (1999) Use of recursive partitioning in the sequential screening of G-protein-coupled receptors. J. Pharmacol. Toxicol. 42, 207–215.
van Rhee, A. M., Stocker, J., Printzenhoff, D., Creech, C., Wagoner, P. K., and Spear, K. L. (2001) Retrospective analysis of an experimental high-throughput screening data set by recursive partitioning. J. Comb. Chem. 3, 267–277.
Warmuth, M. K., Liao, J., Rätsch, G., Mathieson, M., Putta, S., and Lemmen, C. (2003) Active learning with support vector machines in the drug discovery process. J. Chem. Inf. Comput. Sci. 43, 667–673.
Todeschini, R. and Consonni, V. (2000) Handbook of molecular descriptors. Wiley-VCH, Weinheim, Germany.
Leach, A. R. and Gillet, V. J. (2003) An introduction to chemoinformatics. Kluwer Academic Publishers, London, UK.
Brown, R. D. and Martin, Y. C. (1996) Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J. Chem. Inf. Comput. Sci. 36, 572–584.
Feng, J., Lurati, L., Ouyang, H., et al. (2003) Predictive toxicology: benchmarking molecular descriptors and statistical methods. J. Chem. Inf. Comput. Sci. 43, 1463–1470.
Burden, F. R. (1989) Molecular identification number for substructure searches. J. Chem. Inf. Comput. Sci. 29, 225–227.
Pearlman, R. S. and Smith, K. M. (1998) Novel software tools for chemical diversity. Persp. Drug Disc. Des. 09/10/11, 339–353.
Hastie, T., Tibshirani, R., and Friedman, J. (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New York, NY.
Zemroch, P. J. (1986) Cluster analysis as an experimental design generator, with application to gasoline blending experiments. Technometrics 28, 39–49.
Hansch, C., Unger, S. H., and Forsythe, A. B. (1973) Strategy in drug design. Cluster analysis as an aid in the selection of substituents. J. Med. Chem. 16, 1217–1222.
Hodes, L. (1989) Clustering a large number of compounds. 1. Establishing the method on an initial sample. J. Chem. Inf. Comput. Sci. 29, 66–71.
Cummins D. J., Andrews C. W., Bentley J. A., and Cory, M. (1996) Molecular diversity in chemical databases: Comparison of medicinal chemistry knowledge bases and databases of commercially available compounds. J. Chem. Inf. Comput. Sci. 36, 750–763.
Menard, P. R., Mason, J. S., Morize, I., and Bauerschmidt, S. (1998) Chemistry space metrics in diversity analysis, library design, and compound selection. J. Chem. Inf. Comput. Sci. 38, 1204–1213.
McFarland, J. W. and Gans, D.J. (1986) On the significance of clusters in the graphical display of structure-activity data. J. Med. Chem. 29, 505–514.
Lam, R. L. H. (2001) Design and analysis of large chemical databases for drug discovery, Ph.D. Dissertation, University of Waterloo.
Lam, R. L. H., Welch, W. J., and Young, S. S. (2002) Uniform coverage designs for molecule selection. Technometrics 44, 99–109.
Pearlman, R. S. and Smith, K. M. (1999) Metric validation and the receptor-relevant subspace concept. J. Chem. Inf. Comput. Sci. 39, 28–35.
Kennard, R. W., and Stone, L. A. (1969) Computer aided design of experiments. Technometrics 11, 137–148.
Johnson, M. E., Moore, L. M., and Ylvisaker, D. (1990) Minimax and maximin distance designs. J. Statist. Plan. Infer. 26, 131–148.
Higgs, R. E., Bemis, K. G., Watson, I. A., and Wikel, J. H. (1997) Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comput. Sci. 37, 861–870.
Lam, R. L. H., Welch, W. J., and Young, S. S. (2002) Cell-based analysis of high throughput screening data for drug discovery. Research Report RR-02-02, Institute for Improvement in Quality and Productivity, University of Waterloo.
Yi, B., Hughes-Oliver, J. M., Zhu, L., and Young, S. S. (2002) A factorial design to optimize cell-based drug discovery analysis. J. Chem. Inf. Comput. Sci. 42, 1221–1229.
Young, S. S., Farmen, M., and Rusinko, A. III (1996) Random versus rational: Which is better for general compound screening? Network Science online publication, available at URL: http://www.netsci.org/Science/Screening/feature09.html.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Humana Press Inc.
About this protocol
Cite this protocol
Lam, R.L., Welch, W.J. (2004). Comparison of Methods Based on Diversity and Similarity for Molecule Selection and the Analysis of Drug Discovery Data. In: Bajorath, J. (eds) Chemoinformatics. Methods in Molecular Biology™, vol 275. Humana Press. https://doi.org/10.1385/1-59259-802-1:301
Download citation
DOI: https://doi.org/10.1385/1-59259-802-1:301
Publisher Name: Humana Press
Print ISBN: 978-1-58829-261-2
Online ISBN: 978-1-59259-802-1
eBook Packages: Springer Protocols