# What Do We Know?: Simple Statistical Techniques that Help

Protocol

First Online:

## Abstract

An understanding of simple statistical techniques is invaluable in science and in life. Despite this, and despite the sophistication of many concerning the methods and algorithms of molecular modeling, statistical analysis is usually rare and often uncompelling. I present here some basic approaches that have proved useful in my own work, along with examples drawn from the field. In particular, the statistics of evaluations of virtual screening are carefully considered.

### Key words

Statistics Central Limit Theorem Variance Standard deviation Confidence limits*p*-Values Propagation of error Error bars

*logit*transform Virtual screening ROC curves AUC Enrichment Correlation Student’s

*t*-test ANOVA

### References

- 1.Loredo, T. J., From Laplace to Supernova SN 1987A: Bayesian inference in Astrophysics. Maximum Entropy and Bayesian Methods. P. F. Fougere (ed). Kluwer Academic, Netherlands: 1990, 81–142.CrossRefGoogle Scholar
- 2.Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P.,
*Numerical Recipes 3rd Edition: The Art of Scientific Computing*. 3rd ed; Cambridge University Press, New York: 2007.Google Scholar - 3.Wainer, H., The most dangerous equation: Ignorance of how sample size affects statistical variation has created havoc for nearly a millennium.
*Am. Sci.*2007, 248–256.Google Scholar - 4.Stigler, S. M., Statistics and the question of standards.
*J. Res. Natl. Inst. Stand. Technol.*1996,**101**, 779–789.CrossRefGoogle Scholar - 5.
- 6.DeLong, E. R.; DeLong, D. M.; Clarke-Pearson, D. L., Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach.
*Biometrics*1988,**44**, 837–845.PubMedCrossRefGoogle Scholar - 7.Cortes, C.; Mohri, M., Confidence intervals for the area under the ROC curve.
*Adv. Neural. Inf. Process. Syst.*2004,**17**, 305–312.Google Scholar - 8.Huang, N.; Shoichet, B. K.; Irwin, J. J., Benchmarking sets for molecular docking.
*J. Med. Chem.*2006,**49**, 6789–6801.PubMedCrossRefGoogle Scholar - 9.Bayly, C. I.; Truchon, J.F., Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem.
*J. Chem. Inf. Model*., 2007,**47**, 488–508.PubMedCrossRefGoogle Scholar - 10.Jain, A. N., Surflex-Dock 2.1: Robust performance from ligand energetic modeling, ring flexibility, and knowledge-based search.
*J*.*Comput. Aided Mol. Des.*2007,**21**, 281–306.PubMedCrossRefGoogle Scholar - 11.Skillman, A. G.; Nicholls, A.,
*SAMPL2: Statistical Analysis of the Modeling of Proteins and Ligands*: 2008.Google Scholar - 12.Scargle, J. D., Publication bias: The “File-Drawer” problem in scientific inference.
*J. Sci. Explor.*2000,**14**, 91–106.Google Scholar - 13.Ziliak, S. T.; McCloskey, D. N.,
*The Cult of Statistical Significance*. The University of Michigan Press, USA: 2007.Google Scholar - 14.Warren, G. L.; Andrews, C. W.; Capelli, A. M.; Clarke, B.; LaLonde, J.; Lambert, M. H.; Lindvall, M.; Nevins, N.; Semus, S. F.; Senger, S.; Tedesco, G.; Wall, I. D.; Woolven, J. M.; Peishoff, C. E.; Head, M. S., A critical assessment of docking programs and scoring functions.
*J. Med. Chem.*2006,**49**, 5912–5931.PubMedCrossRefGoogle Scholar - 15.Enyedy, I. J.; Egan, W. J., Can we use docking and scoring for hit-to-lead optimization?
*J. Comput. Aided Mol. Des.*2008,**22**, 161–168.PubMedCrossRefGoogle Scholar - 16.Rerks-Ngarm, S.; Pitisuttithum, P.; Nitayaphan, S.; Kaewkungwal, J.; Chiu, J.; Paris, R.; Premsri, N.; Namwat, C.; de Souza, M.; Adams, E.; Benenson, M.; Gurunathan, S.; Tartaglia, J.; McNeil, J. G.; Francis, D. P.; Stablein, D.; Birx, D. L.; Chunsuttiwat, S.; Khamboonruang, C.; Thongcharoen, P.; Robb, M. L.; Michael, N. L.; Kunasol, P.; Kim, J. H., Vaccination with ALVAC and AIDSVAX to prevent HIV-1 infection in Thailand.
*N. Engl. J. Med.*2009,**361**, 2209–2220.PubMedCrossRefGoogle Scholar - 17.Welch, B. L., The generalization of “student’s” problem when several different population variances are involved.
*Biometrika*1946,**34**, 28–35.Google Scholar - 18.Satterhwaite, F. E., An approximate distribution of estimates of variance components.
*Biometrics Bull*. 1947,**2**, 110–114.CrossRefGoogle Scholar - 19.Glantz, S. A., How to detect, correct, and prevent errors in the medical literature.
*Circulation*1980,**61**, 1–7.PubMedCrossRefGoogle Scholar - 20.Snedecor, G. W.; Cochran, W. G.,
*Statistical Methods*. 8th ed.; Blackwell Publishing, Malden, MA: 1989.Google Scholar - 21.McGann, M. R.; Almond, H. R.; Nicholls, A.; Grant, J. A.; Brown, F. K., Gaussian docking functions.
*Biopolymers*2003,**68**, 76–90.PubMedCrossRefGoogle Scholar - 22.Rush, T. S.; Grant, J. A.; Mosyak, L.; Nicholls, A., A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction.
*J. Med. Chem.*2005,**48**, 1489–1495.PubMedCrossRefGoogle Scholar - 23.Glantz, S. A.,
*Primer of Biostatistics*. 5th ed.; McGraw-Hill, New York: 2002.Google Scholar - 24.Kanji, G. K.,
*100 Statistical Tests*. 3 rd ed.; Sage Publications, London: 2006.Google Scholar - 25.Bulmer, M. G.,
*Principles of Statistics*. Dover, USA: 1979.Google Scholar - 26.Keeping, E. S.,
*Introduction to Statistical Inference*. Dover, USA: 1995.Google Scholar - 27.van Belle, G.,
*Statistical Rules of Thumb*. Wiley, New York: 2002.Google Scholar - 28.Pepe, M. S.,
*The Statistical Evaluation of Medical Tests for Classifaction and Prediction*. Oxford University Press: 2004.Google Scholar - 29.Good, P. I.; Hardin, J. W.,
*Common Errors in Statistics (and How to Avoid Them)*. 2nd ed.; Wiley-InterScience, New Jersey: 2006.CrossRefGoogle Scholar - 30.Moye, L. A.,
*Statistical Reasoning in Medicine*. 2nd ed.; Springer, New York: 2006.CrossRefGoogle Scholar - 31.Silvia, D. S.,
*Data Analysis: A Bayesian Tutorial*. Oxford Science Publications: 1996.Google Scholar - 32.Marin, J. -M.; Robert, C. P.,
*Bayesian Core: A Practical Approach to Computational Bayesian Statistics*. Springer, New York: 2007.Google Scholar - 33.Carlin, B. P.; Loius, T. A.,
*Bayes and Empirical Bayes Methods for Data Analysis*. 2nd ed.; Chapman & Hall/CRC, Boca Raton, FL: 2000.CrossRefGoogle Scholar - 34.Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G., Reoptimization of MDL keys for use in drug discovery.
*J. Chem. Inf. Comput. Sci*. 2002,**42**, 1273–1280.PubMedCrossRefGoogle Scholar - 35.Vidal, D.; Thormann, M.; Pons, M., LINGO, an efficient holographic text based method to calculate biophysical properties and intermolecular similarities.
*J. Complement. Integr. Med.*2005,**45**, 386–393.Google Scholar

## Copyright information

© Humana Press 2011