Analysis and use of fragment-occurrence data in similarity-based virtual screening

  • Shereena M. Arif
  • John D. Holliday
  • Peter Willett


Current systems for similarity-based virtual screening use similarity measures in which all the fragments in a fingerprint contribute equally to the calculation of structural similarity. This paper discusses the weighting of fragments on the basis of their frequencies of occurrence in molecules. Extensive experiments with sets of active molecules from the MDL Drug Data Report and the World of Molecular Bioactivity databases, using fingerprints encoding Tripos holograms, Pipeline Pilot ECFC_4 circular substructures and Sunset Molecular keys, demonstrate clearly that frequency-based screening is generally more effective than conventional, unweighted screening. The results suggest that standardising the raw occurrence frequencies by taking the square root of the frequencies will maximise the effectiveness of virtual screening. An upper-bound analysis shows the complex interactions that can take place between representations, weighting schemes and similarity coefficients when similarity measures are computed, and provides a rationalisation of the relative performance of the various weighting schemes.


Fingerprint Fragment occurrences Ligand-based virtual screening Similarity searching Substructural fragment Tanimoto coefficient Virtual screening Weighting scheme 



We thank Accelrys Software Inc., Sunset Molecular Discovery LLC, Symyx Technologies Inc. and Tripos Inc. for software and data, the Royal Society and the Wolfson Foundation for laboratory support, and the Government of Malaysia for funding.


  1. 1.
    Böhm H-J, Schneider G (eds) (2000) Virtual screening for bioactive molecules. Wiley-VCH, WeinheimGoogle Scholar
  2. 2.
    Klebe G (ed) (2000) Virtual screening: an alternative or complement to high throughput screening. Kluwer, DordrechtGoogle Scholar
  3. 3.
    Bajorath J (2002) Nat Rev Drug Discov 1:882. doi: 10.1038/nrd941 CrossRefGoogle Scholar
  4. 4.
    Lengauer T, Lemmen C, Rarey M, Zimmermann M (2004) Drug Discov Today 9:27. doi: 10.1016/S1359-6446(04)02939-3 CrossRefGoogle Scholar
  5. 5.
    Oprea TI, Matter H (2004) Curr Opin Chem Biol 8:349. doi: 10.1016/j.cbpa.2004.06.008 CrossRefGoogle Scholar
  6. 6.
    Alvarez J, Shoichet B (eds) (2005) Virtual screening in drug discovery. CRC Press, Boca RatonGoogle Scholar
  7. 7.
    Gasteiger J (ed) (2003) Handbook of chemoinformatics. Wiley-VCH, WeinheimGoogle Scholar
  8. 8.
    Leach AR, Gillet VJ (2007) An introduction to chemoinformatics, 2nd edn. Kluwer, DordrechtGoogle Scholar
  9. 9.
    Willett P, Barnard JM, Downs GM (1998) J Chem Inf Comput Sci 38:983. doi: 10.1021/ci9800211 Google Scholar
  10. 10.
    Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A (2004) Org Biomol Chem 2:3256. doi: 10.1039/b409865j CrossRefGoogle Scholar
  11. 11.
    Martin YC, Kofron JL, Traphagen LM (2002) J Med Chem 45:4350. doi: 10.1021/jm020155c CrossRefGoogle Scholar
  12. 12.
    Sheridan RP, Feuston BP, Maiorov VN, Kearsley SK (2004) J Chem Inf Comput Sci 44:1912. doi: 10.1021/ci049782w Google Scholar
  13. 13.
    Godden JW, Stahura FL, Bajorath J (2005) J Chem Inf Comput Sci 45:1812. doi: 10.1021/ci050276w Google Scholar
  14. 14.
    Willett P (2006) Drug Discov Today 11:1046. doi: 10.1016/j.drudis.2006.10.005 CrossRefGoogle Scholar
  15. 15.
    Eckert H, Bajorath J (2007) Drug Discov Today 12:225. doi: 10.1016/j.drudis.2007.01.011 CrossRefGoogle Scholar
  16. 16.
    Sheridan RP (2007) Expert Opin Drug Discov 2:423. doi: 10.1517/17460441.2.4.423 CrossRefGoogle Scholar
  17. 17.
    McGaughey GB, Sheridan RP, Bayly CI, Culberson JC, Kreatsoulas C, Lindsley S, Maiorov V, Truchon J-F, Cornell WD (2007) J Chem Inf Model 47(4):1504. doi: 10.1021/ci700052x CrossRefGoogle Scholar
  18. 18.
    Willett P (2009) Ann Rev Inf Sci Technol 43:3Google Scholar
  19. 19.
    Chu C-W, Holliday JD, Willett P (2009) J Chem Inf Model 49:155. doi: 10.1021/ci800224h CrossRefGoogle Scholar
  20. 20.
    Ormerod A, Willett P, Bawden D (1989) Quant Struct-Activ Relat 8:115. doi: 10.1002/qsar.19890080207 CrossRefGoogle Scholar
  21. 21.
    Goldman BB, Walters WP (2006) Ann Report Comput Chem 2:127CrossRefGoogle Scholar
  22. 22.
    Crisman TJ, Sisay MT, Bajorath J (2008) J Chem Inf Model 48:1955. doi: 10.1021/ci800229q CrossRefGoogle Scholar
  23. 23.
    Stiefl N, Zaliani A (2006) J Chem Inf Model 46:587. doi: 10.1021/ci050324c CrossRefGoogle Scholar
  24. 24.
    Willett P, Winterman V (1986) Quant Struct-Activ Relat 5:18. doi: 10.1002/qsar.19860050105 CrossRefGoogle Scholar
  25. 25.
    Jorgensen WL, Duffy EM (2002) Adv Drug Deliv Rev 54(3):355. doi: 10.1016/S0169-409X(02)00008-X CrossRefGoogle Scholar
  26. 26.
    Olah M, Bologa C, Oprea TI (2004) J Comput Aided Mol Des 18:437. doi: 10.1007/s10822-004-4060-8 CrossRefGoogle Scholar
  27. 27.
    Azencott C-A, Ksikes A, Swamidass SJ, Chen JH, Ralaivola L, Baldi P (2007) J Chem Inf Model 47:965. doi: 10.1021/ci600397p CrossRefGoogle Scholar
  28. 28.
    Chen X, Reynolds CH (2002) J Chem Inf Comput Sci 42:1407. doi: 10.1021/ci025531g Google Scholar
  29. 29.
    Fechner U, Paetz J, Schneider G (2005) QSAR Comb Sci 24:961. doi: 10.1002/qsar.200530118 CrossRefGoogle Scholar
  30. 30.
    Stiefl N, Watson IA, Baumann K, Zaliani A (2006) J Chem Inf Model 46:208. doi: 10.1021/ci050457y CrossRefGoogle Scholar
  31. 31.
    Brown RD, Martin YC (1996) J Chem Inf Comput Sci 36:572. doi: 10.1021/ci9501047 Google Scholar
  32. 32.
    Ewing TJA, Baber JC, Feher F (2006) J Chem Inf Model 46:2423. doi: 10.1021/ci060155b CrossRefGoogle Scholar
  33. 33.
    Good AC, Cho SJ, Mason JS (2004) J Comput Aided Mol Des 18:523. doi: 10.1007/s10822-004-4065-3 CrossRefGoogle Scholar
  34. 34.
    Bender A, Mussa HY, Glen RC, Reiling S (2004) J Chem Inf Comput Sci 44:1708. doi: 10.1021/ci0498719 Google Scholar
  35. 35.
    Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A (2005) J Med Chem 48:7049. doi: 10.1021/jm050316n CrossRefGoogle Scholar
  36. 36.
    Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A (2006) J Chem Inf Model 46:462. doi: 10.1021/ci050348j CrossRefGoogle Scholar
  37. 37.
    Grant JA, Haigh JA, Pickup BT, Nicholls A, Sayle RA (2006) J Chem Inf Model 46:1912. doi: 10.1021/ci6002152 CrossRefGoogle Scholar
  38. 38.
    Fischer JR, Rarey M (2007) J Chem Inf Model 47:1341. doi: 10.1021/ci700007b CrossRefGoogle Scholar
  39. 39.
    Bemis GW, Murcko MA (1996) J Med Chem 39:2887. doi: 10.1021/jm9602928 CrossRefGoogle Scholar
  40. 40.
    Snarey M, Terrett NK, Willett P, Wilton DJ (1997) J Mol Graph Model 15:372. doi: 10.1016/S1093-3263(98)00008-4 CrossRefGoogle Scholar
  41. 41.
    Böhm H-J, Flohr A, Stahl M (2004) Drug Discov Today. Technology 1(3):217Google Scholar
  42. 42.
    Brown N, Jacoby E (2006) Mini Rev Med Chem 6:1217. doi: 10.2174/138955706778742768 CrossRefGoogle Scholar
  43. 43.
    Schneider G, Schneider P, Renner S (2006) QSAR Comb Sci 25:1162. doi: 10.1002/qsar.200610091 CrossRefGoogle Scholar
  44. 44.
    Tong W, Lowis DR, Perkins R, Chen Y, Welsh WJ, Goddette DW, Heritage TW, Sheehan DM (1998) J Chem Inf Comput Sci 38:669. doi: 10.1021/ci980008g Google Scholar
  45. 45.
    Seel M, Turner DB, Willett P (1999) Quant Struct-Activ Relat 18:245. doi: 10.1002/(SICI)1521-3838(199907)18:3<245::AID-QSAR245>3.0.CO;2-O CrossRefGoogle Scholar
  46. 46.
    Hassan M, Brown RD, Varma-O’Brien S, Rogers D (2006) Mol Divers 10:283. doi: 10.1007/s11030-006-9041-5 CrossRefGoogle Scholar
  47. 47.
    Gardiner EJ, Gillet VJ, Haranczyk M, Hert J, Holliday JD, Malim N, Patel Y, Willett P (2008) Statistical analysis and data mining (in press)Google Scholar
  48. 48.
    Durant JL, Leland BA, Henry DR, Nourse JG (2002) J Chem Inf Model 42:1273. doi: 10.1021/ci010132r CrossRefGoogle Scholar
  49. 49.
    Schneider G, Neidhart W, Giller T, Schmid G (1999) Angew Chem Int Ed 38:2894. doi: 10.1002/(SICI)1521-3773(19991004)38:19<2894::AID-ANIE2894>3.0.CO;2-F CrossRefGoogle Scholar
  50. 50.
    Bologa C, Allu TK, Olah M, Kappler MA, Oprea TI (2005) J Comput Aided Mol Des 19:625. doi: 10.1007/s10822-005-9020-4 CrossRefGoogle Scholar
  51. 51.
    Salton G, Buckley C (1988) Inf Process Manage 24:513. doi: 10.1016/0306-4573(88)90021-0 CrossRefGoogle Scholar
  52. 52.
    Salton G (1989) Automatic text processing. Addison-Wesley, Reading, MAGoogle Scholar
  53. 53.
    Siegel S, Castellan NJ (1988) Nonparametric statistics for the behavioural sciences, 2nd edn. McGraw-Hill, New YorkGoogle Scholar
  54. 54.
    Willett P (2004) Methods Mol Biol 275:51CrossRefGoogle Scholar
  55. 55.
    Johnson MA, Maggiora GM (eds) (1990) Concepts and applications of molecular similarity. John Wiley, New YorkGoogle Scholar
  56. 56.
    Holliday JD, Hu C-Y, Willett P (2002) Comb Chem High Throughput Screen 5:155Google Scholar
  57. 57.
    Holliday JD, Salim N, Whittle M, Willett P (2003) J Chem Inf Comput Sci 43:819. doi: 10.1021/ci034001x Google Scholar
  58. 58.
    Willett P (2006) QSAR Comb Sci 25:1143. doi: 10.1002/qsar.200610084 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2009

Authors and Affiliations

  • Shereena M. Arif
    • 1
  • John D. Holliday
    • 1
  • Peter Willett
    • 1
  1. 1.Krebs Institute for Biomolecular Research and Department of Information StudiesUniversity of SheffieldSheffieldUK

Personalised recommendations