Selection of Informative Examples in Chemogenomic Datasets

  • Daniel RekerEmail author
  • J. B. Brown
Part of the Methods in Molecular Biology book series (MIMB, volume 1825)


High-throughput and high-content screening campaigns have resulted in the creation of large chemogenomic matrices. These matrices form the training data which is used to build ligand–target interaction models for pharmacological and chemical biology research. While academic, government, and industrial efforts continuously add to the ligand–target data pairs available for modeling, major research efforts are devoted to improving machine learning techniques to cope with the sparseness, heterogeneity, and size of available datasets as well as inherent noise and bias. This “race of arms” has led to the creation of algorithms to generate highly complex models with high prediction performance at the cost of training efficiency as well as interpretability.

In contrast, recent studies have challenged the necessity for “big data” in chemogenomic modeling and found that models built on larger numbers of examples do not necessarily result in better predictive abilities. Automated adaptive selection of the training data (ligand–target instances) used for model creation can result in considerably smaller training sets that retain prediction performance on par with training using hundreds of thousands of data points. In this chapter, we describe the protocols used for one such iterative chemogenomic selection technique, including model construction and update as well as possible techniques for evaluations of constructed models and analysis of the iterative model construction.

Key words

Active learning Subset selection Model complexity Data mining Sampling 



The authors would like to thank Prof. Dr. Gisbert Schneider (ETH Zurich) and Dr. Petra Schneider ( for consultation and assistance during the development of the chemogenomic active learning technique and related analyses. An academic license for use of the OpenEye chemoinformatics libraries is kindly acknowledged. J.B. Brown wishes to express thanks to Kyoto University (Ishizue Research Development Program) and the Japanese Society for the Promotion of Science (grants 25870336, JP16H06306, 17K20043, Core-to-Core A) for resources that contributed to the development of the methodology. Daniel Reker is grateful for support from the Swiss National Science Foundation (grants P2EZP3_168827 and P300P2_177833).


  1. 1.
    Bajorath J (2008) Computational approaches in chemogenomics and chemical biology: current and future impact on drug discovery. Expert Opin Drug Discov 3:1371–1376CrossRefGoogle Scholar
  2. 2.
    Jacoby E (2011) Computational chemogenomics. WIREs Comput Mol Sci 1:57–67. Scholar
  3. 3.
    van Westen GJP, Wegner JK, Ijzerman AP et al (2011) Proteochemometric modeling as a tool to design selective compounds and for extrapolating to novel targets. Med Chem Commun 2:16–30. Scholar
  4. 4.
    Bleicher KH (2002) Chemogenomics: bridging a drug discovery gap. Curr Med Chem 9:2077–2084. Scholar
  5. 5.
    Bredel M, Jacoby E (2004) Chemogenomics: an emerging strategy for rapid target and drug discovery. Nat Rev Genet 5:262–275CrossRefGoogle Scholar
  6. 6.
    Hopkins AL, Mason JS, Overington JP (2006) Can we rationally design promiscuous drugs? Curr Opin Struct Biol 16:127–136CrossRefGoogle Scholar
  7. 7.
    Pérez-Sianes J, Pérez-Sánchez H, Díaz F (2016) Virtual screening: a challenge for deep learning. In: Saberi Mohamad M, Rocha PM, Fdez-Riverola F et al (eds) 10th international conference on practical applications of computational biology and bioinformatics. Springer International Publishing, Cham, pp 13–22Google Scholar
  8. 8.
    Gawehn E, Hiss JA, Schneider G (2016) Deep learning in drug discovery. Mol Inform 35:3–14. Scholar
  9. 9.
    Unterthiner T, Mayr A, Klambauer G, et al (2014) Deep learning for drug target prediction. Work. Represent. Learn. Methods complex outputsGoogle Scholar
  10. 10.
    Kalliokoski T, Kramer C, Vulpetti A, Gedeck P (2013) Comparability of mixed IC50 data – a statistical analysis. PLoS One 8:e61007. Scholar
  11. 11.
    Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasetsGoogle Scholar
  12. 12.
    Mestres J, Gregori-Puigjané E, Valverde S, Solé RV (2008) Data completeness—the Achilles heel of drug-target networks. Nat Biotechnol 26:983–984. Scholar
  13. 13.
    Nguyen A, Yosinski J, Clune J (2014) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. arXiv Prepr. arXiv1412.1897Google Scholar
  14. 14.
    Yabuuchi H, Niijima S, Takematsu H et al (2011) Analysis of multiple compound–protein interactions reveals novel bioactive molecules. Mol Syst Biol. Scholar
  15. 15.
    van Westen GJP, Wegner JK, Geluykens P et al (2011) Which compound to select in lead optimization? Prospectively validated proteochemometric models guide preclinical development. PLoS One 6:e27518. Scholar
  16. 16.
    Erhan D, Courville A, Vincent P (2010) Why does unsupervised pre-training help deep learning? JMLR 11:625–660. Scholar
  17. 17.
    Tu JV (1996) Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. J Clin Epidemiol 49:1225–1231. Scholar
  18. 18.
    Clark JH, Frederking R, Levin L (2008) Toward active learning in data selection: automatic discovery of language features during elicitation. In: Sixth international conference on language resources and evaluationGoogle Scholar
  19. 19.
    Reker D, Schneider G (2015) Active-learning strategies in computer-assisted drug discovery. Drug Discov Today 20:458–465. Scholar
  20. 20.
    Settles B (2012) Active learning. Synth Lect Artif Intell Mach Learn. Scholar
  21. 21.
    Breiman L (2001) Random forests. Mach Learn 45:5–32. Scholar
  22. 22.
    Svetnik V, Liaw A, Tong C et al (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43:1947–1958. Scholar
  23. 23.
    Rakers C, Reker D, Brown JB (2017) Small random forest models for effective chemogenomic active learning. J Comput Aided Chem 8:124–142CrossRefGoogle Scholar
  24. 24.
    Reker D, Schneider P, Schneider G, Brown J (2017) Active learning for computational chemogenomics. Future Med Chem 9:381–402. Scholar
  25. 25.
    Witten IH, Frank E, Hall MA (2011) Data mining. Morgan Kaufmann Ser Data Manag Syst.<9823::AID-ANIE9823>3.3.CO;2-C
  26. 26.
    Mitchell TM (1997) Machine learning. McGraw-Hill, Maidenhead, UK. Scholar
  27. 27.
    Cortes-Ciriano I, Ain QU, Subramanian V et al (2015) Polypharmacology modelling using proteochemometrics (PCM): recent methodological developments, applications to target families, and future prospects. Med Chem Commun 6:24–50. Scholar
  28. 28.
    Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge. doi:10.2277Google Scholar
  29. 29.
    Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188. Scholar
  30. 30.
    Krogh A (2008) What are artificial neural networks? Nat Biotechnol 26:195–197CrossRefGoogle Scholar
  31. 31.
    Zupan J, Gasteiger J (1999) Neural networks in chemistry and drug design. Wiley-VCH, WeinheimGoogle Scholar
  32. 32.
    Schneider G, Wrede P (1998) Artificial neural networks for computer-based molecular design. Prog Biophys Mol Biol 70:175–222CrossRefGoogle Scholar
  33. 33.
    Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297. Scholar
  34. 34.
    Ivanciuc O (2007) Applications of support vector machines in chemistry. In: Lipkowitz KB, Cundari TR (eds) Reviews in computational chemistry, vol 23. Wiley-VCH, Weinheim, pp 291–400CrossRefGoogle Scholar
  35. 35.
    LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. Scholar
  36. 36.
    Andersson CR, Gustafsson MG, Strömbergsson H (2011) Quantitative chemogenomics: machine-learning models of protein-ligand interaction. Curr Top Med Chem 11:1978–1993. Scholar
  37. 37.
    He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21:1263–1284. Scholar
  38. 38.
    Statnikov A, Wang L, Aliferis C (2008) A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 9:319CrossRefGoogle Scholar
  39. 39.
    Willett P (2000) Chemoinformatics – similarity and diversity in chemical libraries. Curr Opin Biotechnol 11:85–88. Scholar
  40. 40.
    Kawasaki K, Kondoh E, Chigusa Y et al (2015) Reliable pre-eclampsia pathways based on multiple independent microarray data sets. MHR Basic Sci Reprod Med 21:217–224. Scholar
  41. 41.
    Bento AP, Gaulton A, Hersey A et al (2014) The ChEMBL bioactivity database: an update. Nucleic Acids Res 42:D1083–D1090. Scholar
  42. 42.
    Brown J, Akutsu T (2009) Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology. BMC Bioinformatics 10:25. Scholar
  43. 43.
    Bhasin M, Raghava GPS (2004) Prediction of CTL epitopes using QM, SVM and ANN techniques. Vaccine 22:3195–3204. Scholar
  44. 44.
    Bhasin M, Reinherz EL, Reche PA (2006) Recognition and classification of histones using support vector machine. J Comput Biol 13:102–112. Scholar
  45. 45.
    Fujishima K, Komasa M, Kitamura S et al (2007) Proteome-wide prediction of novel DNA/RNA-binding proteins using amino acid composition and periodicity in the hyperthermophilic archaeon pyrococcus furiosus. DNA Res 14:91–102. Scholar
  46. 46.
    Yu C-S, Chen Y-C, Lu C-H, Hwang J-K (2006) Prediction of protein subcellular localization. Proteins 64:643–651. Scholar
  47. 47.
    Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:1–27. Scholar
  48. 48.
    Mitchell TM (1997) Decision tree learning. In: Machine learning. McGraw-Hill, Inc., New York, NY, pp 52–80Google Scholar
  49. 49.
    Boulesteix A-L, Janitza S, Kruppa J, König IR (2012) Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip Rev Data Min Knowl Discov 2:493–507. Scholar
  50. 50.
    Svetnik V, Liaw A, Tong C, Wang T (2004) Application of Breiman’s random forest to modeling structure-activity relationships of pharmaceutical molecules. Springer, Berlin, Heidelberg, pp 334–343Google Scholar
  51. 51.
    Palmer DS, O’Boyle NM, Glen RC, Mitchell JBO (2006) Random forest models to predict aqueous solubility. J Chem Inf Model. Scholar
  52. 52.
    Segal MR (2004) Machine learning benchmarks and random forest regression. Kluwer Academic Publishers, Dordrecht, NetherlandsGoogle Scholar
  53. 53.
    Guha R, Bender A (2012) Computational approaches in cheminformatics and bioinformatics. Wiley, Hoboken, NJGoogle Scholar
  54. 54.
    Stahl M, Guba W, Kansy M (2006) Integrating molecular design resources within modern drug discovery research: the Roche experience. Drug Discov Today 11:326–333. Scholar
  55. 55.
    Brown JB, Niijima S, Okuno Y (2013) Compound-protein interaction prediction within chemogenomics: theoretical concepts, practical usage, and future directions. Mol Inform 32:906–921. Scholar
  56. 56.
    Reker D, Schneider P, Schneider G (2016) Multi-objective active machine learning rapidly improves structure-activity models and reveals new protein-protein interaction inhibitors. Chem Sci 7:3919–3927. Scholar
  57. 57.
    Ma C, Wang L, Xie XQ (2011) Ligand classifier of adaptively boosting ensemble decision stumps (LiCABEDS) and its application on modeling ligand functionality for 5HT-subtype GPCR families. J Chem Inf Model 51:521–531. Scholar
  58. 58.
    Grömping U (2009) Variable importance assessment in regression: linear regression versus random forest. Am Stat 63:308–319. Scholar
  59. 59.
    Fujiwara Y, Yamashita Y, Osoda T et al (2008) Virtual screening system for finding structurally diverse hits by active learning. J Chem Inf Model 48:930–940CrossRefGoogle Scholar
  60. 60.
    Lang T, Flachsenberg F, Von Luxburg U, Rarey M (2016) Feasibility of active machine learning for multiclass compound classification. J Chem Inf Model 56:12–20. Scholar
  61. 61.
    Alvarsson J, Lampa S, Schaal W et al (2016) Large-scale ligand-based predictive modelling using support vector machines. J Cheminform 8:39. Scholar
  62. 62.
    Guyon I, Cawley G, Dror G et al (2012) Active learning challenge: challenges in machine learning, vol 6. Microtome Publishing, River Edge, NJGoogle Scholar
  63. 63.
    Bajorath J (2002) Integration of virtual and high-throughput screening. Nat Rev Drug Discov 1:882–894CrossRefGoogle Scholar
  64. 64.
    Schneider G, Hartenfeller M, Reutlinger M et al (2009) Voyages to the (un)known: adaptive design of bioactive compounds. Trends Biotechnol 27:18–26. Scholar
  65. 65.
    Desai B, Dixon K, Farrant E et al (2013) Rapid discovery of a novel series of Abl kinase inhibitors by application of an integrated microfluidic synthesis and screening platform. J Med Chem 56:3033–3047CrossRefGoogle Scholar
  66. 66.
    Kangas JD, Naik AW, Murphy RF (2014) Efficient discovery of responses of proteins to compounds using active learning. BMC Bioinformatics. Scholar
  67. 67.
    Besnard J, Ruda GF, Setola V et al (2012) Automated design of ligands to polypharmacological profiles. Nature 492:215–220CrossRefGoogle Scholar
  68. 68.
    Ahmadi M, Vogt M, Iyer P et al (2013) Predicting potent compounds via model-based global optimization. J Chem Inf Model 53:553–559CrossRefGoogle Scholar
  69. 69.
    Reutlinger M, Rodrigues T, Schneider P, Schneider G (2014) Combining on-chip synthesis of a focused combinatorial library with computational target prediction reveals imidazopyridine GPCR ligands. Angew Chem Int Ed 53:582–585. Scholar
  70. 70.
    Engels MF, Venkatarangan P (2001) Smart screening: approaches to efficient HTS. Curr Opin Drug Discov Devel 4:275–283PubMedGoogle Scholar
  71. 71.
    Gureckis TM, Markant DB (2012) Self-directed learning a cognitive and computational perspective. Perspect Psychol Sci 7:464–481CrossRefGoogle Scholar
  72. 72.
    Ramamoorthy CV, Wah BW (1989) Knowledge and data engineering. IEEE Trans Knowl Data Eng 1:9–16. Scholar
  73. 73.
    Weill N, Rognan D (2009) Development and validation of a novel protein–ligand fingerprint to mine chemogenomic space: application to G protein-coupled receptors and their ligands. J Chem Inf Model 49:1049–1062CrossRefGoogle Scholar
  74. 74.
    Lapins M, Worachartcheewan A, Spjuth O et al (2013) A unified proteochemometric model for prediction of inhibition of cytochrome P450 isoforms. PLoS One 8:e66566. Scholar
  75. 75.
    Baldi P, Brunak S, Chauvin Y et al (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16:412–424CrossRefGoogle Scholar
  76. 76.
    Lesk A (2013) Introduction to bioinformatics. Oxford University Press, OxfordGoogle Scholar
  77. 77.
    Wang JTL, Zaki MJ, Toivonen HTT, Shasha D (2005) Introduction to data mining in bioinformatics. In: Data mining in bioinformatics. Springer-Verlag, London, pp 3–8CrossRefGoogle Scholar
  78. 78.
    Yang Z, Nielsen R, Hasegawa M (1998) Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol 15:1600–1611CrossRefGoogle Scholar
  79. 79.
    Weisel M, Kriegl JM, Schneider G (2010) Architectural repertoire of ligand-binding pockets on protein surfaces. ChemBioChem 11:556–563. Scholar
  80. 80.
    Paricharak S, IJzerman AP, Jenkins JL et al (2016) Data-driven derivation of an “Informer Compound Set” for improved selection of active compounds in high-throughput screening. J Chem Inf Model 56:1622–1630. Scholar
  81. 81.
    Saigo H, Vert J-P, Ueda N, Akutsu T (2004) Protein homology detection using string alignment kernels. Bioinformatics 20:1682–1689. Scholar
  82. 82.
    Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425. Scholar
  83. 83.
    Cock PJA, Antao T, Chang JT et al (2009) BioPython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25:1422–1423CrossRefGoogle Scholar
  84. 84.
    Huson DH, Rupp R, Scornavacca C (2010) Phylogenetic networks: concepts, algorithms and applications. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  85. 85.
    Huerta-Cepas J, Dopazo J, Gabaldon T (2010) ETE: a Python environment for tree exploration. BMC Bioinformatics 11:24CrossRefGoogle Scholar
  86. 86.
    McKinney W (2011) pandas: a foundational Python library for data analysis and statistics. Python High Perform Sci Comput:1–9Google Scholar
  87. 87.
    McKinney W (2012) Chapter 7 – Data wrangling: clean, transform, merge, reshape. In: Python for data analysis: data wrangling with pandas, numpy, and ipython. O’Reilly Media, Sebastopol, CA, pp 177–219Google Scholar
  88. 88.
    Baell JB, Holloway GA (2010) New substructure filters for removal of pan assay interference compounds (pains) from screening libraries and for their exclusion in bioassays. J Med Chem 53:2719–2740. Scholar
  89. 89.
    Rishton GM (2003) Nonleadlikeness and leadlikeness in biochemical screening. Drug Discov Today 8:86–96. Scholar
  90. 90.
    Leslie C, Kuang R (2004) Fast string kernels using inexact matching for protein sequences. JMLR 5:1435–1455Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Koch Institute for Integrative Cancer ResearchMassachusetts Institute of TechnologyCambridgeUSA
  2. 2.Life Science Informatics Research Unit, Laboratory of Molecular BiosciencesKyoto University Graduate School of MedicineKyotoJapan

Personalised recommendations