Molecular discovery by optimal sequential search

  • Genyuan LiEmail author
Original Paper


In the development of a new compound in chemistry and molecular biology, especially a new medicine in pharmaceutical industry, we often need to find candidate(s), a molecule or molecules, with the best desired property (e.g., binding affinity in medicine) from a large set of molecules with the same scaffold but m distinct functional substitutes at each of its n different sites. The total number \(N_{\mathrm{lib}}\) of molecules in this library is \(m^n\). In some cases, \(N_{\mathrm{lib}}\) can be a very large number (e.g., millions). This is a challenging task because it is costly and often infeasible to synthesize and test all of these molecules. A new algorithm referred to as optimal sequential search is developed to overcome this difficulty. Especially, this algorithm is chemically intuitive which only uses the information of molecule composition, and accessible to practical chemists. The algorithm can be applied to small, medium and large size molecule libraries. With syntheses and property measurements for a limited number of molecules, the top best candidate molecules can be effectively captured from the whole library. Three examples with library size 64, 160,000 and 1,048,576, respectively, are used for illustration. For the first small library, syntheses and property measurements of 17 molecules are sufficient to capture the top 7 best candidate molecules; for the two medium and large libraries, syntheses and property measurements of about one thousand molecules can capture most or a large part of the top 500, especially the top 100 best candidate molecules. However, the algorithm needs to perform multiple (e.g., hundreds of) iterative syntheses and property measurements. The time cost may not be acceptable if the algorithm is performed manually. To make the algorithm practical, automation of the sequential searching process is the following task.


Molecular discovery Drug discovery Optimal sequential search Additive Gaussian process Efficient global optimization High throughput screening Parallel synthesis Tyrosine derived biodegradable polymer Protein G domain B1 Oligonucleotides 


Supplementary material

10910_2019_1062_MOESM1_ESM.txt (9 kb)
Supplementary material 1 (txt 8 KB)
10910_2019_1062_MOESM2_ESM.xls (4.7 mb)
Supplementary material 2 (xls 4816 KB)


  1. 1.
    A. Carnero, High throughput screening in drug discovery. Clin. Transl. Oncol. 8(7), 482–490 (2006)CrossRefGoogle Scholar
  2. 2.
    J.B. Taylor, D.J. Triggle, Comprehensive Medicinal Chemistry II (Elsevier, Amsterdam, 2007)Google Scholar
  3. 3.
    J. Bajorath, Computer-aided drug discovery. F1000Research 4(F1000 Faculty Rev), 630 (2015). CrossRefGoogle Scholar
  4. 4.
    B.K. Shoichet, Virtual screening of chemical libraries. Nature 432(7019), 862–865 (2004)CrossRefGoogle Scholar
  5. 5.
    G. Maggiora, M. Vogt, D. Stumpfe et al., Molecular similarity in medicinal chemistry. J. Med. Chem. 57(8), 3186–3204 (2014)CrossRefGoogle Scholar
  6. 6.
    D.B. Kitchen, H. Decornez, J.R. Furr et al., Docking and scoring in virtual screening for drug discovery: methods and applications. Nat. Rev. Drug Discov. 3(11), 935–949 (2004)CrossRefGoogle Scholar
  7. 7.
    J. Bajorath, Integration of virtual and high-throughput screening. Nat. Rev. Drug Discov. 1(11), 882–894 (2002)CrossRefGoogle Scholar
  8. 8.
    V. Kholodovych, J.R. Smith, D. Knight, S. Abramson, J. Kohn, W.J. Welsh, Accurate predictions of cellular response using QSPR: a feasibility test of rational design of polymeric biomaterials. Polymer 45, 7367–7379 (2004)CrossRefGoogle Scholar
  9. 9.
    D.R. Jones, M. Schonlau, W.J. Welsh, Efficient global optimization of expensive black-box functions. J. Glob. Optim. 13, 455–492 (1998)CrossRefGoogle Scholar
  10. 10.
    M.A. Mohamad, T.P. Sapsis, A sequential sampling strategy for extreme event statistics in nonlinear dynamical systems, in Proceedings of the National Academy of Sciences of the United States of America (2018)Google Scholar
  11. 11.
    E. Li, F. Ye, H. Wang, Alternative Kriging-HDMR optimization method with expected improvement sampling strategy. Eng. Comput. 34(6), 1807–1828 (2017)CrossRefGoogle Scholar
  12. 12.
    C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning (MIT Press, Cambridge, MA, 2006)Google Scholar
  13. 13.
    D. Duvenaud, H. Nickisch, C.E. Rasmussen, Additive Gaussian processes, in Advances in Neural Information Processing Systems, 24 (NIPS 2011)Google Scholar
  14. 14.
    N.C. Wu, L. Dai, C.A. Olson, L.O. Lloyd-Smith, R. Sun, Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5, e16965 (2016). CrossRefGoogle Scholar
  15. 15.
    W. Rowe, M. Platt, D.C. Wedge, P.J. Day, D.B. Kell, J. Knowles, Analysis of a complete DNA-protein affinity landscape. J. R. Soc. Interface 7, 397–408 (2010)CrossRefGoogle Scholar
  16. 16.
    T. Siggers, A.B. Chang, A. Teixeira, D. Wong, K.J. Williams, B. Ahmed, J. Ragoussis, I.A. Udalova, S.T. Smale, M.L. Bulyk, Principles of dimer-specific gene regulation revealed by a comprehensive characterization of NF-\(\kappa \)B family DNA binding. Nat. Immunol. 13(1), 95–102 (2012)CrossRefGoogle Scholar
  17. 17.
    NF-\(\kappa \)B Dataset.
  18. 18.
    C. Cattani, M. Scalia, G. Mattioli, Entropy distribution and information content in DNA sequences, in Conference: International Conference on Potential Theory and Complex Analysis, Kiev, 8–11 Maggio (2006)Google Scholar
  19. 19.
    P. Lió, Wavelets in bioinformatics and computational biology: state of art and perspectives. Bioinform. Rev. 19(1), 2–9 (2003)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of ChemistryPrinceton UniversityPrincetonUSA

Personalised recommendations