Abstract
An understanding of simple statistical techniques is invaluable in science and in life. Despite this, and despite the sophistication of many concerning the methods and algorithms of molecular modeling, statistical analysis is usually rare and often uncompelling. I present here some basic approaches that have proved useful in my own work, along with examples drawn from the field. In particular, the statistics of evaluations of virtual screening are carefully considered.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Loredo, T. J., From Laplace to Supernova SN 1987A: Bayesian inference in Astrophysics. Maximum Entropy and Bayesian Methods. P. F. Fougere (ed). Kluwer Academic, Netherlands: 1990, 81–142.
Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P., Numerical Recipes 3rd Edition: The Art of Scientific Computing. 3rd ed; Cambridge University Press, New York: 2007.
Wainer, H., The most dangerous equation: Ignorance of how sample size affects statistical variation has created havoc for nearly a millennium. Am. Sci. 2007, 248–256.
Stigler, S. M., Statistics and the question of standards. J. Res. Natl. Inst. Stand. Technol. 1996, 101, 779–789.
Student, The probably error of a mean. Biometrika 1908, 6, 1–25.
DeLong, E. R.; DeLong, D. M.; Clarke-Pearson, D. L., Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988, 44, 837–845.
Cortes, C.; Mohri, M., Confidence intervals for the area under the ROC curve. Adv. Neural. Inf. Process. Syst. 2004, 17, 305–312.
Huang, N.; Shoichet, B. K.; Irwin, J. J., Benchmarking sets for molecular docking. J. Med. Chem. 2006, 49, 6789–6801.
Bayly, C. I.; Truchon, J.F., Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem. J. Chem. Inf. Model., 2007, 47, 488–508.
Jain, A. N., Surflex-Dock 2.1: Robust performance from ligand energetic modeling, ring flexibility, and knowledge-based search. J. Comput. Aided Mol. Des. 2007, 21, 281–306.
Skillman, A. G.; Nicholls, A., SAMPL2: Statistical Analysis of the Modeling of Proteins and Ligands: 2008.
Scargle, J. D., Publication bias: The “File-Drawer” problem in scientific inference. J. Sci. Explor. 2000, 14, 91–106.
Ziliak, S. T.; McCloskey, D. N., The Cult of Statistical Significance. The University of Michigan Press, USA: 2007.
Warren, G. L.; Andrews, C. W.; Capelli, A. M.; Clarke, B.; LaLonde, J.; Lambert, M. H.; Lindvall, M.; Nevins, N.; Semus, S. F.; Senger, S.; Tedesco, G.; Wall, I. D.; Woolven, J. M.; Peishoff, C. E.; Head, M. S., A critical assessment of docking programs and scoring functions. J. Med. Chem. 2006, 49, 5912–5931.
Enyedy, I. J.; Egan, W. J., Can we use docking and scoring for hit-to-lead optimization? J. Comput. Aided Mol. Des. 2008, 22, 161–168.
Rerks-Ngarm, S.; Pitisuttithum, P.; Nitayaphan, S.; Kaewkungwal, J.; Chiu, J.; Paris, R.; Premsri, N.; Namwat, C.; de Souza, M.; Adams, E.; Benenson, M.; Gurunathan, S.; Tartaglia, J.; McNeil, J. G.; Francis, D. P.; Stablein, D.; Birx, D. L.; Chunsuttiwat, S.; Khamboonruang, C.; Thongcharoen, P.; Robb, M. L.; Michael, N. L.; Kunasol, P.; Kim, J. H., Vaccination with ALVAC and AIDSVAX to prevent HIV-1 infection in Thailand. N. Engl. J. Med. 2009, 361, 2209–2220.
Welch, B. L., The generalization of “student’s” problem when several different population variances are involved. Biometrika 1946, 34, 28–35.
Satterhwaite, F. E., An approximate distribution of estimates of variance components. Biometrics Bull. 1947, 2, 110–114.
Glantz, S. A., How to detect, correct, and prevent errors in the medical literature. Circulation 1980, 61, 1–7.
Snedecor, G. W.; Cochran, W. G., Statistical Methods. 8th ed.; Blackwell Publishing, Malden, MA: 1989.
McGann, M. R.; Almond, H. R.; Nicholls, A.; Grant, J. A.; Brown, F. K., Gaussian docking functions. Biopolymers 2003, 68, 76–90.
Rush, T. S.; Grant, J. A.; Mosyak, L.; Nicholls, A., A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. J. Med. Chem. 2005, 48, 1489–1495.
Glantz, S. A., Primer of Biostatistics. 5th ed.; McGraw-Hill, New York: 2002.
Kanji, G. K., 100 Statistical Tests. 3 rd ed.; Sage Publications, London: 2006.
Bulmer, M. G., Principles of Statistics. Dover, USA: 1979.
Keeping, E. S., Introduction to Statistical Inference. Dover, USA: 1995.
van Belle, G., Statistical Rules of Thumb. Wiley, New York: 2002.
Pepe, M. S., The Statistical Evaluation of Medical Tests for Classifaction and Prediction. Oxford University Press: 2004.
Good, P. I.; Hardin, J. W., Common Errors in Statistics (and How to Avoid Them). 2nd ed.; Wiley-InterScience, New Jersey: 2006.
Moye, L. A., Statistical Reasoning in Medicine. 2nd ed.; Springer, New York: 2006.
Silvia, D. S., Data Analysis: A Bayesian Tutorial. Oxford Science Publications: 1996.
Marin, J. -M.; Robert, C. P., Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Springer, New York: 2007.
Carlin, B. P.; Loius, T. A., Bayes and Empirical Bayes Methods for Data Analysis. 2nd ed.; Chapman & Hall/CRC, Boca Raton, FL: 2000.
Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G., Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273–1280.
Vidal, D.; Thormann, M.; Pons, M., LINGO, an efficient holographic text based method to calculate biophysical properties and intermolecular similarities. J. Complement. Integr. Med. 2005, 45, 386–393.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Appendices
Appendix 1: Using logit to Get Error Bounds
In the section on enrichment as a metric for virtual screening we arrived at a formula for the variance of the enrichment.
Let’s assume that R, the ratio of inactives to actives, is very large so we just have:
In our example, the ROC enrichment was fivefold at 1% inactives because the fraction of actives was 0.05. This would mean the 95% error would be:
Now the enrichment is 5.0. If N a < (40.5/5.0)2, i.e., N a < 65, then the lower error bound becomes negative, i.e., a nonsense value. The problem is, as mentioned in the text, that the quantity of interest, the fraction f a of actives, is bounded between 0 and 1. However, if we transform with the logit function it becomes unbounded, just like the Gaussian.
If we make this transformation then the fraction f a = 0.05 becomes:
This is our new mean. Now, we have to recalculate the variance in logit space, i.e.,
This means the error bounds on the logit version of f a become:
Suppose we set N a to a value much less than the “silly” threshold of 65. Let’s make it 25. In non-logit space this means the 95% range of enrichments is:
Clearly, the lower range is nonsense. Now consider the range of f a in logit space:
Now it is perfectly ok that the lower range is negative because logit functions go from negative to positive infinity. The final step, then, is to transform these values back to a fraction, using the inverse logit function, i.e.,
And then divide by f i to get the enrichment. If we do this we get:
Clearly, these are large error bounds, error bounds that actually include an enrichment of less than random! However, they are not negative and they are a reflection of the difficulty of pinning the enrichment down with so few actives. Even if we repeat the analysis with four times as many actives, i.e., N a = 100, the 95% range is still [2.1, 11.5]. The untransformed range for N a = 100 is ~[1.0, 9.0].
Appendix 2: Why Variances Add
Suppose we have two sources of error that can move the measured value away from its true mean, and let’s suppose that mean value is zero for simplicity. The CLT tells us that each source alone will produce a distribution of values according to the number of observations and the intrinsic variance of each source:
Now x and y are independent variations from the mean; therefore, the probability of observing an error of x from the first source and y from the second source has to be the joint probability, i.e.,
Now for such a combination of errors the total error is just (x + y). So what is the average square of the error, i.e., the variance, over all possible x and y? This is just the two dimensional averaging (i.e., integral) of (x + y)2, weighted by pdf α, β(x, y), i.e.,
We can split this into three integrals by expanding (x + y)2. Thus:
We can rewrite the first term as:
Therefore, we can integrate the integral over y independently. We can do the same thing for the third term for x. This leads to:
Now, given that the mean is zero, the first term is just the integral for the variance due to x, the third term is the integral for the variance due to y, and the second term must be zero because it separates into the product of two integrals each of which must be zero as they calculate the average value of x and y, respectively, both zero. Therefore:
The astute reader will notice that we could have performed the same sequence of operations with any pdf, not just a Gaussian, and arrived at the same conclusion. The key steps are multiplying the individual pdfs together, separating the resultant integral into three integrals, two of which are the individual variances and the third must be equal to zero because we defined the mean of each pdf to zero. That is, this is a general result, not just one pertaining to Gaussian forms of the distribution function.
Appendix 3: Deriving the Hanley Formula for AUC Error
Recall that we have the following equation for the variance of either the actives or the inactives, where w = AUC for the former and w = 1 − AUC for the later:
The assumption by Hanley is that the pdf for both actives and inactives follows an exponential distribution, e.g.
Here, x is a score for either active or inactive that determines its rank (higher = better). These forms integrate from 0 to positive infinity to 1.0 as required. Since we can always rescale x by a constant and still have the same rankings, let’s set the lambda for inactives to 1, i.e.,
Given these probability density functions we can write down an expression for the AUC either in terms of the fraction of inactives of lower score than each active, or as the fraction of actives higher than each inactive. The math is a little cleaner if we do the latter:
The first term in the integral is the density of inactives at score x and this is multiplied by the fraction of actives with a score greater than x. If we substitute the Hanley pdfs we get:
We can see that this looks correct because if lambda is greater than one the scores of the actives must fall off more quickly than the inactives and the AUC will be less than 0.5, but if it is less than one it has a longer tail of positive scores and so has an AUC greater than 0.5. Now, let’s consider the variance for the inactives:
This is just the equivalent of <p> − <p> 2 we normally see for a variance but we are integrating over the pdf for inactives. Expanding and solving the integral we get:
Now, the nice thing about the Hanley choice is that we can substitute for lambda from the AUC, i.e.,
Setting w = 1 − AUC, we get:
And the result required is obtained. A further nice thing about the Hanley pdf is that we can get a simple expression for the ROC curve. If we want to know what fraction, f, of inactives or actives have a score greater than z we have:
But (f(z)inactive, f(z)active) are the points on the ROC curve, parameterized by z. Therefore, to express the one in terms of the other we simply have:
This is the form of the Hanley ROC curve for a given AUC value. It can be a pretty good fit to real data!
Rights and permissions
Copyright information
© 2011 Humana Press
About this protocol
Cite this protocol
Nicholls, A. (2011). What Do We Know?: Simple Statistical Techniques that Help. In: Bajorath, J. (eds) Chemoinformatics and Computational Chemical Biology. Methods in Molecular Biology, vol 672. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-60761-839-3_22
Download citation
DOI: https://doi.org/10.1007/978-1-60761-839-3_22
Published:
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-60761-838-6
Online ISBN: 978-1-60761-839-3
eBook Packages: Springer Protocols