Automated Inference of Chemical Discriminants of Biological Activity

  • Sebastian Raschka
  • Anne M. Scott
  • Mar Huertas
  • Weiming Li
  • Leslie A. Kuhn
Part of the Methods in Molecular Biology book series (MIMB, volume 1762)


Ligand-based virtual screening has become a standard technique for the efficient discovery of bioactive small molecules. Following assays to determine the activity of compounds selected by virtual screening, or other approaches in which dozens to thousands of molecules have been tested, machine learning techniques make it straightforward to discover the patterns of chemical groups that correlate with the desired biological activity. Defining the chemical features that generate activity can be used to guide the selection of molecules for subsequent rounds of screening and assaying, as well as help design new, more active molecules for organic synthesis.

The quantitative structure–activity relationship machine learning protocols we describe here, using decision trees, random forests, and sequential feature selection, take as input the chemical structure of a single, known active small molecule (e.g., an inhibitor, agonist, or substrate) for comparison with the structure of each tested molecule. Knowledge of the atomic structure of the protein target and its interactions with the active compound are not required. These protocols can be modified and applied to any data set that consists of a series of measured structural, chemical, or other features for each tested molecule, along with the experimentally measured value of the response variable you would like to predict or optimize for your project, for instance, inhibitory activity in a biological assay or ΔGbinding. To illustrate the use of different machine learning algorithms, we step through the analysis of a dataset of inhibitor candidates from virtual screening that were tested recently for their ability to inhibit GPCR-mediated signaling in a vertebrate.

Key words

Fingerprint analysis GPCR Invasive species control Ligand-based screening Machine learning Pharmacophore Quantitative structure–activity relationship Random forest Virtual screening 







3-keto petromyzonol sulfate


Chemical Abstracts Service Registry


Cambridge Structural Database






G protein-coupled receptor


Quantitative structure–activity relationship


Sequential backward selection


Sequential feature selection


Virtual screening


Zinc Is Not Commercial database, version 12



This research was supported by funding from the Great Lakes Fishery Commission from 2012 to 2017 (Project ID: 2015_KUH_54031). We gratefully acknowledge OpenEye Scientific Software (Santa Fe, NM) for providing academic licenses for the use of their ROCS, Omega, QUACPAC (molcharge), and OEChem toolkit software. We also wish to express our special appreciation to the open source community for developing and sharing the freely accessible Python libraries for data processing, machine learning, and plotting that were used for the data analysis presented in this chapter.


  1. 1.
    Ripphausen P, Nisius B, Bajorath J (2011) State-of-the-art in ligand-based virtual screening. Drug Discov Today 16:372–376CrossRefPubMedGoogle Scholar
  2. 2.
    Geppert H, Vogt M, Bajorath J (2010) Current trends in ligand-based virtual screening: molecular representations, data mining methods, new application areas, and performance evaluation. J Chem Inf Model 50:205–216CrossRefPubMedGoogle Scholar
  3. 3.
    Pérez-Nueno VI, Ritchie DW, Rabal O, Pascual R, Borrell JI, Teixidó J (2008) Comparison of ligand-based and receptor-based virtual screening of HIV entry inhibitors for the CXCR4 and CCR5 Receptors using 3D ligand shape matching and ligand-receptor docking. J Chem Inf Model 48:509–533CrossRefPubMedGoogle Scholar
  4. 4.
    Hawkins PCD, AG S, Nicholls A (2007) Comparison of shape-matching and docking as virtual screening tools. J Med Chem 50:74–82CrossRefPubMedGoogle Scholar
  5. 5.
    Sukuru SCK, Crepin T, Milev Y, Marsh LC, Hill JB, Anderson RJ, Morris JC, Rohatgi A, O’Mahony G, Grøtli M et al (2006) Discovering new classes of Brugia malayi asparaginyl-tRNA synthetase inhibitors and relating specificity to conformational change. J Comput Aided Mol Des 20:159–178CrossRefPubMedGoogle Scholar
  6. 6.
    Lyne PD (2002) Structure-based virtual screening: an overview. Drug Discov Today 7:1047–1055CrossRefPubMedGoogle Scholar
  7. 7.
    Ghosh S, Nie A, An J, Huang Z (2006) Structure-based virtual screening of chemical libraries for drug discovery. Curr Opin Chem Biol 10:194–202CrossRefPubMedGoogle Scholar
  8. 8.
    Li Q, Shah S (2017) Structure-based virtual screening. Methods Mol. Biol. 1558:111–124Google Scholar
  9. 9.
    Yan X, Liao C, Liu Z, T Hagler A, Gu Q, Xu J (2016) Chemical structure similarity search for ligand-based virtual screening: methods and computational resources. Curr Drug Targets 17:1580–1585CrossRefPubMedGoogle Scholar
  10. 10.
    Raschka S, Scott AM, Liu N, Gunturu S, Huertas M, Li W, Kuhn LA (2018) Enabling hypothesis-driven prioritization of small molecules in big databases: screenlamp and its application to GPCR inhibitor discovery. J Comput Aided Mol Des 32:415–433Google Scholar
  11. 11.
    Zavodszky MI, Rohatgi A, Van Voorst JR, Yan H, Kuhn LA (2009) Scoring ligand similarity in structure-based virtual screening. J Mol Recognit 22:280–292CrossRefPubMedGoogle Scholar
  12. 12.
    Buhrow L, Hiser C, Van Voorst JR, Ferguson-Miller S, Kuhn LA (2013) Computational prediction and in vitro analysis of potential physiological ligands of the bile acid binding site in cytochrome c oxidase. Biochemistry 52:6995–7006CrossRefPubMedGoogle Scholar
  13. 13.
    Kubinyi H, Folkers G, Martin YC (eds) (2006) 3D QSAR in drug design: recent advances. Springer, BerlinGoogle Scholar
  14. 14.
    Verma J, Khedkar VM, Coutinho EC (2010) 3D-QSAR in drug design-a review. Curr Top Med Chem 10:95–115CrossRefPubMedGoogle Scholar
  15. 15.
    Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Boca Raton, FLGoogle Scholar
  16. 16.
    Breiman L (2001) Random forests. Mach Learn 45:5–32CrossRefGoogle Scholar
  17. 17.
    Ferri F, Pudil P, Hatef M, Kittler J (1994) Comparative study of techniques for large-scale feature selection. Pattern Recognit Pract IV 1994:403–413Google Scholar
  18. 18.
    Raschka S (2017) rasbt/mlxtend: Version 0.7.0.
  19. 19.
    Hansen GJA, Jones ML (2008) A rapid assessment approach to prioritizing streams for control of Great Lakes sea lampreys (Petromyzon marinus): a case study in adaptive management. Can J Fish Aquat Sci 65:2471–2484CrossRefGoogle Scholar
  20. 20.
    Irwin JJ, Shoichet BK (2005) ZINC—a free database of commercially available compounds for virtual screening. J Chem Inf Model 45:177–182CrossRefPubMedPubMedCentralGoogle Scholar
  21. 21.
    Allen F (2002) The Cambridge Structural Database: a quarter of a million crystal structures and rising. Acta Crystallogr Sect B Struct Sci 58:380–388CrossRefGoogle Scholar
  22. 22.
    Johnson NS, Yun S-S, Li W (2014) Investigations of novel unsaturated bile salts of male sea lamprey as potential chemical cues. J Chem Ecol 40:1152–1160CrossRefPubMedGoogle Scholar
  23. 23.
    Van Rossum G (2007) Python programming language. In: USENIX annual technical conference, p 36Google Scholar
  24. 24.
    Van Der Walt S, Colbert SC, Varoquaux G (2011) The NumPy array: a structure for efficient numerical computation. Comput Sci Eng 13:22–30CrossRefGoogle Scholar
  25. 25.
    Jones E, Oliphant T, Peterson P (2001) SciPy: open source scientific tools for Python.
  26. 26.
    McKinney W, et al. (2010) Data structures for statistical computing in Python. In: Millman J, vand der Walt S (eds) Proceedings of the 9th Python Science conference, pp 51–56Google Scholar
  27. 27.
    Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9:90–95CrossRefGoogle Scholar
  28. 28.
    Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830Google Scholar
  29. 29.
    Aiello A, Carbonelli S, Esposito G, Fattorusso E, Iuvone T, Menna M (2000) Novel bioactive sulfated alkene and alkanes from the Mediterranean ascidian Halocynthia papillosa. J Nat Prod 63:1590–1592CrossRefPubMedGoogle Scholar
  30. 30.
    Raschka S (2015) Python machine learning, 1st edn. Packt Publishing, BirminghamGoogle Scholar
  31. 31.
    Louppe G (2014) Understanding random forests: from theory to practice. Ph.D. thesisGoogle Scholar
  32. 32.
    Walker SH, Duncan DB (1967) Estimation of the probability of an event as a function of several independent variables. Biometrika 54:167–179CrossRefPubMedGoogle Scholar
  33. 33.
    Hughes G (1968) On the mean accuracy of statistical pattern recognizers. IEEE Trans Inf Theory 14:55–63CrossRefGoogle Scholar
  34. 34.
    Raschka S, Mirjalili V (2017) Python machine learning, 2nd edn. Packt Publishing, BirminghamGoogle Scholar
  35. 35.
    Raschka S, Julian D, Hearty J (2016) Python: deeper insights into machine learning, 1st edn. Packt Publishing, BirminghamGoogle Scholar
  36. 36.
    Hastie T, Tibshirani R, Friedman J, Hastie T, Tibshirani R (2001) Springer series in statistics. Springer, New York, NYGoogle Scholar
  37. 37.
    Müller AC, Guido S (2017) Introduction to machine learning with Python: a guide for data scientists. O'Reilly Media, Sebastopol, CAGoogle Scholar
  38. 38.
    Hawkins PCD, Skillman AG, Warren GL, Ellingson BA, Stahl MT (2010) Conformer generation with OMEGA: algorithm and validation using high quality structures from the Protein Databank and Cambridge Structural Database. J Chem Inf Model 50:572–584CrossRefPubMedPubMedCentralGoogle Scholar
  39. 39.
    Hawkins PCD, Nicholls A (2012) Conformer generation with OMEGA: learning from the data set and the analysis of failures. J Chem Inf Model 52:2919–2936CrossRefPubMedGoogle Scholar
  40. 40.
    Raschka S (2017) BioPandas: working with molecular structures in pandas DataFrames. J Open Source Softw. doi:10.21105/joss.00279Google Scholar
  41. 41.
    Strobl C, Boulesteix A, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinformatics 9:307CrossRefPubMedPubMedCentralGoogle Scholar
  42. 42.
    Strobl C, Malley J, Tutz G (2009) An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol Methods 14:323CrossRefPubMedPubMedCentralGoogle Scholar
  43. 43.
    Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517Google Scholar
  44. 44.
    Raymer ML, Punch WF, Goodman ED, Kuhn LA, Jain AK (2000) Dimensionality reduction using genetic algorithms. IEEE Trans Evol Comput 4:164–171CrossRefGoogle Scholar
  45. 45.
    Raymer ML, Sanschagrin PC, Punch WF, Venkataraman S, Goodman ED, Kuhn LA (1997) Predicting conserved water-mediated and polar ligand interactions in proteins using a K-nearest-neighbors genetic algorithm. J Mol Biol 265:445–464CrossRefPubMedGoogle Scholar
  46. 46.
    Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8:1341–1390CrossRefGoogle Scholar
  47. 47.
    Bengio Y, Grandvalet Y (2004) No unbiased estimator of the variance of k-fold cross-validation. J Mach Learn Res 5:1089–1105Google Scholar
  48. 48.
    Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Int Jt Conf Artif Intell 14:1137–1143Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Sebastian Raschka
    • 1
  • Anne M. Scott
    • 2
  • Mar Huertas
    • 2
    • 3
  • Weiming Li
    • 2
  • Leslie A. Kuhn
    • 1
    • 2
    • 4
  1. 1.Department of Biochemistry and Molecular Biology Michigan State UniversityEast LansingUSA
  2. 2.Department of Fisheries and WildlifeMichigan State UniversityEast LansingUSA
  3. 3.Department of BiologyTexas State UniversitySan MarcosUSA
  4. 4.Department of Computer Science and EngineeringMichigan State UniversityEast LansingUSA

Personalised recommendations