What Do We Know?: Simple Statistical Techniques that Help

  • Anthony Nicholls
Part of the Methods in Molecular Biology book series (MIMB, volume 672)


An understanding of simple statistical techniques is invaluable in science and in life. Despite this, and despite the sophistication of many concerning the methods and algorithms of molecular modeling, statistical analysis is usually rare and often uncompelling. I present here some basic approaches that have proved useful in my own work, along with examples drawn from the field. In particular, the statistics of evaluations of virtual screening are carefully considered.

Key words

Statistics Central Limit Theorem Variance Standard deviation Confidence limits p-Values Propagation of error Error bars logit transform Virtual screening ROC curves AUC Enrichment Correlation Student’s t-test ANOVA 


  1. 1.
    Loredo, T. J., From Laplace to Supernova SN 1987A: Bayesian inference in Astrophysics. Maximum Entropy and Bayesian Methods. P. F. Fougere (ed). Kluwer Academic, Netherlands: 1990, 81–142.CrossRefGoogle Scholar
  2. 2.
    Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P., Numerical Recipes 3rd Edition: The Art of Scientific Computing. 3rd ed; Cambridge University Press, New York: 2007.Google Scholar
  3. 3.
    Wainer, H., The most dangerous equation: Ignorance of how sample size affects statistical variation has created havoc for nearly a millennium. Am. Sci. 2007, 248–256.Google Scholar
  4. 4.
    Stigler, S. M., Statistics and the question of standards. J. Res. Natl. Inst. Stand. Technol. 1996, 101, 779–789.CrossRefGoogle Scholar
  5. 5.
    Student, The probably error of a mean. Biometrika 1908, 6, 1–25.Google Scholar
  6. 6.
    DeLong, E. R.; DeLong, D. M.; Clarke-Pearson, D. L., Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988, 44, 837–845.PubMedCrossRefGoogle Scholar
  7. 7.
    Cortes, C.; Mohri, M., Confidence intervals for the area under the ROC curve. Adv. Neural. Inf. Process. Syst. 2004, 17, 305–312.Google Scholar
  8. 8.
    Huang, N.; Shoichet, B. K.; Irwin, J. J., Benchmarking sets for molecular docking. J. Med. Chem. 2006, 49, 6789–6801.PubMedCrossRefGoogle Scholar
  9. 9.
    Bayly, C. I.; Truchon, J.F., Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem. J. Chem. Inf. Model., 2007, 47, 488–508.PubMedCrossRefGoogle Scholar
  10. 10.
    Jain, A. N., Surflex-Dock 2.1: Robust performance from ligand energetic modeling, ring flexibility, and knowledge-based search. J. Comput. Aided Mol. Des. 2007, 21, 281–306.PubMedCrossRefGoogle Scholar
  11. 11.
    Skillman, A. G.; Nicholls, A., SAMPL2: Statistical Analysis of the Modeling of Proteins and Ligands: 2008.Google Scholar
  12. 12.
    Scargle, J. D., Publication bias: The “File-Drawer” problem in scientific inference. J. Sci. Explor. 2000, 14, 91–106.Google Scholar
  13. 13.
    Ziliak, S. T.; McCloskey, D. N., The Cult of Statistical Significance. The University of Michigan Press, USA: 2007.Google Scholar
  14. 14.
    Warren, G. L.; Andrews, C. W.; Capelli, A. M.; Clarke, B.; LaLonde, J.; Lambert, M. H.; Lindvall, M.; Nevins, N.; Semus, S. F.; Senger, S.; Tedesco, G.; Wall, I. D.; Woolven, J. M.; Peishoff, C. E.; Head, M. S., A critical assessment of docking programs and scoring functions. J. Med. Chem. 2006, 49, 5912–5931.PubMedCrossRefGoogle Scholar
  15. 15.
    Enyedy, I. J.; Egan, W. J., Can we use docking and scoring for hit-to-lead optimization? J. Comput. Aided Mol. Des. 2008, 22, 161–168.PubMedCrossRefGoogle Scholar
  16. 16.
    Rerks-Ngarm, S.; Pitisuttithum, P.; Nitayaphan, S.; Kaewkungwal, J.; Chiu, J.; Paris, R.; Premsri, N.; Namwat, C.; de Souza, M.; Adams, E.; Benenson, M.; Gurunathan, S.; Tartaglia, J.; McNeil, J. G.; Francis, D. P.; Stablein, D.; Birx, D. L.; Chunsuttiwat, S.; Khamboonruang, C.; Thongcharoen, P.; Robb, M. L.; Michael, N. L.; Kunasol, P.; Kim, J. H., Vaccination with ALVAC and AIDSVAX to prevent HIV-1 infection in Thailand. N. Engl. J. Med. 2009, 361, 2209–2220.PubMedCrossRefGoogle Scholar
  17. 17.
    Welch, B. L., The generalization of “student’s” problem when several different population variances are involved. Biometrika 1946, 34, 28–35.Google Scholar
  18. 18.
    Satterhwaite, F. E., An approximate distribution of estimates of variance components. Biometrics Bull. 1947, 2, 110–114.CrossRefGoogle Scholar
  19. 19.
    Glantz, S. A., How to detect, correct, and prevent errors in the medical literature. Circulation 1980, 61, 1–7.PubMedCrossRefGoogle Scholar
  20. 20.
    Snedecor, G. W.; Cochran, W. G., Statistical Methods. 8th ed.; Blackwell Publishing, Malden, MA: 1989.Google Scholar
  21. 21.
    McGann, M. R.; Almond, H. R.; Nicholls, A.; Grant, J. A.; Brown, F. K., Gaussian docking functions. Biopolymers 2003, 68, 76–90.PubMedCrossRefGoogle Scholar
  22. 22.
    Rush, T. S.; Grant, J. A.; Mosyak, L.; Nicholls, A., A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. J. Med. Chem. 2005, 48, 1489–1495.PubMedCrossRefGoogle Scholar
  23. 23.
    Glantz, S. A., Primer of Biostatistics. 5th ed.; McGraw-Hill, New York: 2002.Google Scholar
  24. 24.
    Kanji, G. K., 100 Statistical Tests. 3 rd ed.; Sage Publications, London: 2006.Google Scholar
  25. 25.
    Bulmer, M. G., Principles of Statistics. Dover, USA: 1979.Google Scholar
  26. 26.
    Keeping, E. S., Introduction to Statistical Inference. Dover, USA: 1995.Google Scholar
  27. 27.
    van Belle, G., Statistical Rules of Thumb. Wiley, New York: 2002.Google Scholar
  28. 28.
    Pepe, M. S., The Statistical Evaluation of Medical Tests for Classifaction and Prediction. Oxford University Press: 2004.Google Scholar
  29. 29.
    Good, P. I.; Hardin, J. W., Common Errors in Statistics (and How to Avoid Them). 2nd ed.; Wiley-InterScience, New Jersey: 2006.CrossRefGoogle Scholar
  30. 30.
    Moye, L. A., Statistical Reasoning in Medicine. 2nd ed.; Springer, New York: 2006.CrossRefGoogle Scholar
  31. 31.
    Silvia, D. S., Data Analysis: A Bayesian Tutorial. Oxford Science Publications: 1996.Google Scholar
  32. 32.
    Marin, J. -M.; Robert, C. P., Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Springer, New York: 2007.Google Scholar
  33. 33.
    Carlin, B. P.; Loius, T. A., Bayes and Empirical Bayes Methods for Data Analysis. 2nd ed.; Chapman & Hall/CRC, Boca Raton, FL: 2000.CrossRefGoogle Scholar
  34. 34.
    Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G., Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273–1280.PubMedCrossRefGoogle Scholar
  35. 35.
    Vidal, D.; Thormann, M.; Pons, M., LINGO, an efficient holographic text based method to calculate biophysical properties and intermolecular similarities. J. Complement. Integr. Med. 2005, 45, 386–393.Google Scholar

Copyright information

© Humana Press 2011

Authors and Affiliations

  • Anthony Nicholls
    • 1
  1. 1.OpenEye Scientific SoftwareSanta FeUSA

Personalised recommendations