Abstract
Ligand-based virtual screening has become a standard technique for the efficient discovery of bioactive small molecules. Following assays to determine the activity of compounds selected by virtual screening, or other approaches in which dozens to thousands of molecules have been tested, machine learning techniques make it straightforward to discover the patterns of chemical groups that correlate with the desired biological activity. Defining the chemical features that generate activity can be used to guide the selection of molecules for subsequent rounds of screening and assaying, as well as help design new, more active molecules for organic synthesis.
The quantitative structure–activity relationship machine learning protocols we describe here, using decision trees, random forests, and sequential feature selection, take as input the chemical structure of a single, known active small molecule (e.g., an inhibitor, agonist, or substrate) for comparison with the structure of each tested molecule. Knowledge of the atomic structure of the protein target and its interactions with the active compound are not required. These protocols can be modified and applied to any data set that consists of a series of measured structural, chemical, or other features for each tested molecule, along with the experimentally measured value of the response variable you would like to predict or optimize for your project, for instance, inhibitory activity in a biological assay or ΔGbinding. To illustrate the use of different machine learning algorithms, we step through the analysis of a dataset of inhibitor candidates from virtual screening that were tested recently for their ability to inhibit GPCR-mediated signaling in a vertebrate.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsAbbreviations
- 2D:
-
Two-dimensional
- 3D:
-
Three-dimensional
- 3kPZS:
-
3-keto petromyzonol sulfate
- CAS:
-
Chemical Abstracts Service Registry
- CSD:
-
Cambridge Structural Database
- DKPES:
-
3,12-diketo-4,6-petromyzonene-24-sulfate
- EOG:
-
Electro-olfactogram
- GPCR:
-
G protein-coupled receptor
- QSAR:
-
Quantitative structure–activity relationship
- SBS:
-
Sequential backward selection
- SFS:
-
Sequential feature selection
- VS:
-
Virtual screening
- ZINC12:
-
Zinc Is Not Commercial database, version 12
References
Ripphausen P, Nisius B, Bajorath J (2011) State-of-the-art in ligand-based virtual screening. Drug Discov Today 16:372–376
Geppert H, Vogt M, Bajorath J (2010) Current trends in ligand-based virtual screening: molecular representations, data mining methods, new application areas, and performance evaluation. J Chem Inf Model 50:205–216
Pérez-Nueno VI, Ritchie DW, Rabal O, Pascual R, Borrell JI, Teixidó J (2008) Comparison of ligand-based and receptor-based virtual screening of HIV entry inhibitors for the CXCR4 and CCR5 Receptors using 3D ligand shape matching and ligand-receptor docking. J Chem Inf Model 48:509–533
Hawkins PCD, AG S, Nicholls A (2007) Comparison of shape-matching and docking as virtual screening tools. J Med Chem 50:74–82
Sukuru SCK, Crepin T, Milev Y, Marsh LC, Hill JB, Anderson RJ, Morris JC, Rohatgi A, O’Mahony G, Grøtli M et al (2006) Discovering new classes of Brugia malayi asparaginyl-tRNA synthetase inhibitors and relating specificity to conformational change. J Comput Aided Mol Des 20:159–178
Lyne PD (2002) Structure-based virtual screening: an overview. Drug Discov Today 7:1047–1055
Ghosh S, Nie A, An J, Huang Z (2006) Structure-based virtual screening of chemical libraries for drug discovery. Curr Opin Chem Biol 10:194–202
Li Q, Shah S (2017) Structure-based virtual screening. Methods Mol. Biol. 1558:111–124
Yan X, Liao C, Liu Z, T Hagler A, Gu Q, Xu J (2016) Chemical structure similarity search for ligand-based virtual screening: methods and computational resources. Curr Drug Targets 17:1580–1585
Raschka S, Scott AM, Liu N, Gunturu S, Huertas M, Li W, Kuhn LA (2018) Enabling hypothesis-driven prioritization of small molecules in big databases: screenlamp and its application to GPCR inhibitor discovery. J Comput Aided Mol Des 32:415–433
Zavodszky MI, Rohatgi A, Van Voorst JR, Yan H, Kuhn LA (2009) Scoring ligand similarity in structure-based virtual screening. J Mol Recognit 22:280–292
Buhrow L, Hiser C, Van Voorst JR, Ferguson-Miller S, Kuhn LA (2013) Computational prediction and in vitro analysis of potential physiological ligands of the bile acid binding site in cytochrome c oxidase. Biochemistry 52:6995–7006
Kubinyi H, Folkers G, Martin YC (eds) (2006) 3D QSAR in drug design: recent advances. Springer, Berlin
Verma J, Khedkar VM, Coutinho EC (2010) 3D-QSAR in drug design-a review. Curr Top Med Chem 10:95–115
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca Raton, FL
Breiman L (2001) Random forests. Mach Learn 45:5–32
Ferri F, Pudil P, Hatef M, Kittler J (1994) Comparative study of techniques for large-scale feature selection. Pattern Recognit Pract IV 1994:403–413
Raschka S (2017) rasbt/mlxtend: Version 0.7.0. https://doi.org/10.5281/zenodo.816309
Hansen GJA, Jones ML (2008) A rapid assessment approach to prioritizing streams for control of Great Lakes sea lampreys (Petromyzon marinus): a case study in adaptive management. Can J Fish Aquat Sci 65:2471–2484
Irwin JJ, Shoichet BK (2005) ZINC—a free database of commercially available compounds for virtual screening. J Chem Inf Model 45:177–182
Allen F (2002) The Cambridge Structural Database: a quarter of a million crystal structures and rising. Acta Crystallogr Sect B Struct Sci 58:380–388
Johnson NS, Yun S-S, Li W (2014) Investigations of novel unsaturated bile salts of male sea lamprey as potential chemical cues. J Chem Ecol 40:1152–1160
Van Rossum G (2007) Python programming language. In: USENIX annual technical conference, p 36
Van Der Walt S, Colbert SC, Varoquaux G (2011) The NumPy array: a structure for efficient numerical computation. Comput Sci Eng 13:22–30
Jones E, Oliphant T, Peterson P (2001) SciPy: open source scientific tools for Python. http://www.scipy.org/
McKinney W, et al. (2010) Data structures for statistical computing in Python. In: Millman J, vand der Walt S (eds) Proceedings of the 9th Python Science conference, pp 51–56
Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9:90–95
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Aiello A, Carbonelli S, Esposito G, Fattorusso E, Iuvone T, Menna M (2000) Novel bioactive sulfated alkene and alkanes from the Mediterranean ascidian Halocynthia papillosa. J Nat Prod 63:1590–1592
Raschka S (2015) Python machine learning, 1st edn. Packt Publishing, Birmingham
Louppe G (2014) Understanding random forests: from theory to practice. Ph.D. thesis
Walker SH, Duncan DB (1967) Estimation of the probability of an event as a function of several independent variables. Biometrika 54:167–179
Hughes G (1968) On the mean accuracy of statistical pattern recognizers. IEEE Trans Inf Theory 14:55–63
Raschka S, Mirjalili V (2017) Python machine learning, 2nd edn. Packt Publishing, Birmingham
Raschka S, Julian D, Hearty J (2016) Python: deeper insights into machine learning, 1st edn. Packt Publishing, Birmingham
Hastie T, Tibshirani R, Friedman J, Hastie T, Tibshirani R (2001) Springer series in statistics. Springer, New York, NY
Müller AC, Guido S (2017) Introduction to machine learning with Python: a guide for data scientists. O'Reilly Media, Sebastopol, CA
Hawkins PCD, Skillman AG, Warren GL, Ellingson BA, Stahl MT (2010) Conformer generation with OMEGA: algorithm and validation using high quality structures from the Protein Databank and Cambridge Structural Database. J Chem Inf Model 50:572–584
Hawkins PCD, Nicholls A (2012) Conformer generation with OMEGA: learning from the data set and the analysis of failures. J Chem Inf Model 52:2919–2936
Raschka S (2017) BioPandas: working with molecular structures in pandas DataFrames. J Open Source Softw. doi:10.21105/joss.00279
Strobl C, Boulesteix A, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinformatics 9:307
Strobl C, Malley J, Tutz G (2009) An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol Methods 14:323
Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517
Raymer ML, Punch WF, Goodman ED, Kuhn LA, Jain AK (2000) Dimensionality reduction using genetic algorithms. IEEE Trans Evol Comput 4:164–171
Raymer ML, Sanschagrin PC, Punch WF, Venkataraman S, Goodman ED, Kuhn LA (1997) Predicting conserved water-mediated and polar ligand interactions in proteins using a K-nearest-neighbors genetic algorithm. J Mol Biol 265:445–464
Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8:1341–1390
Bengio Y, Grandvalet Y (2004) No unbiased estimator of the variance of k-fold cross-validation. J Mach Learn Res 5:1089–1105
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Int Jt Conf Artif Intell 14:1137–1143
Acknowledgments
This research was supported by funding from the Great Lakes Fishery Commission from 2012 to 2017 (Project ID: 2015_KUH_54031). We gratefully acknowledge OpenEye Scientific Software (Santa Fe, NM) for providing academic licenses for the use of their ROCS, Omega, QUACPAC (molcharge), and OEChem toolkit software. We also wish to express our special appreciation to the open source community for developing and sharing the freely accessible Python libraries for data processing, machine learning, and plotting that were used for the data analysis presented in this chapter.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Raschka, S., Scott, A.M., Huertas, M., Li, W., Kuhn, L.A. (2018). Automated Inference of Chemical Discriminants of Biological Activity. In: Gore, M., Jagtap, U. (eds) Computational Drug Discovery and Design. Methods in Molecular Biology, vol 1762. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7756-7_16
Download citation
DOI: https://doi.org/10.1007/978-1-4939-7756-7_16
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-7755-0
Online ISBN: 978-1-4939-7756-7
eBook Packages: Springer Protocols