What Do We Know?: Simple Statistical Techniques that Help

Nicholls, Anthony

doi:10.1007/978-1-60761-839-3_22

Anthony Nicholls²

Part of the book series: Methods in Molecular Biology ((MIMB,volume 672))

3282 Accesses
8 Citations
9 Altmetric

Abstract

An understanding of simple statistical techniques is invaluable in science and in life. Despite this, and despite the sophistication of many concerning the methods and algorithms of molecular modeling, statistical analysis is usually rare and often uncompelling. I present here some basic approaches that have proved useful in my own work, along with examples drawn from the field. In particular, the statistics of evaluations of virtual screening are carefully considered.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Loredo, T. J., From Laplace to Supernova SN 1987A: Bayesian inference in Astrophysics. Maximum Entropy and Bayesian Methods. P. F. Fougere (ed). Kluwer Academic, Netherlands: 1990, 81–142.
Chapter Google Scholar
Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P., Numerical Recipes 3rd Edition: The Art of Scientific Computing. 3rd ed; Cambridge University Press, New York: 2007.
Google Scholar
Wainer, H., The most dangerous equation: Ignorance of how sample size affects statistical variation has created havoc for nearly a millennium. Am. Sci. 2007, 248–256.
Google Scholar
Stigler, S. M., Statistics and the question of standards. J. Res. Natl. Inst. Stand. Technol. 1996, 101, 779–789.
Article Google Scholar
Student, The probably error of a mean. Biometrika 1908, 6, 1–25.
Google Scholar
DeLong, E. R.; DeLong, D. M.; Clarke-Pearson, D. L., Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988, 44, 837–845.
Article PubMed CAS Google Scholar
Cortes, C.; Mohri, M., Confidence intervals for the area under the ROC curve. Adv. Neural. Inf. Process. Syst. 2004, 17, 305–312.
Google Scholar
Huang, N.; Shoichet, B. K.; Irwin, J. J., Benchmarking sets for molecular docking. J. Med. Chem. 2006, 49, 6789–6801.
Article PubMed CAS Google Scholar
Bayly, C. I.; Truchon, J.F., Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem. J. Chem. Inf. Model., 2007, 47, 488–508.
Article PubMed Google Scholar
Jain, A. N., Surflex-Dock 2.1: Robust performance from ligand energetic modeling, ring flexibility, and knowledge-based search. J. Comput. Aided Mol. Des. 2007, 21, 281–306.
Article PubMed CAS Google Scholar
Skillman, A. G.; Nicholls, A., SAMPL2: Statistical Analysis of the Modeling of Proteins and Ligands: 2008.
Google Scholar
Scargle, J. D., Publication bias: The “File-Drawer” problem in scientific inference. J. Sci. Explor. 2000, 14, 91–106.
Google Scholar
Ziliak, S. T.; McCloskey, D. N., The Cult of Statistical Significance. The University of Michigan Press, USA: 2007.
Google Scholar
Warren, G. L.; Andrews, C. W.; Capelli, A. M.; Clarke, B.; LaLonde, J.; Lambert, M. H.; Lindvall, M.; Nevins, N.; Semus, S. F.; Senger, S.; Tedesco, G.; Wall, I. D.; Woolven, J. M.; Peishoff, C. E.; Head, M. S., A critical assessment of docking programs and scoring functions. J. Med. Chem. 2006, 49, 5912–5931.
Article PubMed CAS Google Scholar
Enyedy, I. J.; Egan, W. J., Can we use docking and scoring for hit-to-lead optimization? J. Comput. Aided Mol. Des. 2008, 22, 161–168.
Article PubMed CAS Google Scholar
Rerks-Ngarm, S.; Pitisuttithum, P.; Nitayaphan, S.; Kaewkungwal, J.; Chiu, J.; Paris, R.; Premsri, N.; Namwat, C.; de Souza, M.; Adams, E.; Benenson, M.; Gurunathan, S.; Tartaglia, J.; McNeil, J. G.; Francis, D. P.; Stablein, D.; Birx, D. L.; Chunsuttiwat, S.; Khamboonruang, C.; Thongcharoen, P.; Robb, M. L.; Michael, N. L.; Kunasol, P.; Kim, J. H., Vaccination with ALVAC and AIDSVAX to prevent HIV-1 infection in Thailand. N. Engl. J. Med. 2009, 361, 2209–2220.
Article PubMed CAS Google Scholar
Welch, B. L., The generalization of “student’s” problem when several different population variances are involved. Biometrika 1946, 34, 28–35.
Google Scholar
Satterhwaite, F. E., An approximate distribution of estimates of variance components. Biometrics Bull. 1947, 2, 110–114.
Article Google Scholar
Glantz, S. A., How to detect, correct, and prevent errors in the medical literature. Circulation 1980, 61, 1–7.
Article PubMed CAS Google Scholar
Snedecor, G. W.; Cochran, W. G., Statistical Methods. 8th ed.; Blackwell Publishing, Malden, MA: 1989.
Google Scholar
McGann, M. R.; Almond, H. R.; Nicholls, A.; Grant, J. A.; Brown, F. K., Gaussian docking functions. Biopolymers 2003, 68, 76–90.
Article PubMed CAS Google Scholar
Rush, T. S.; Grant, J. A.; Mosyak, L.; Nicholls, A., A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. J. Med. Chem. 2005, 48, 1489–1495.
Article PubMed CAS Google Scholar
Glantz, S. A., Primer of Biostatistics. 5th ed.; McGraw-Hill, New York: 2002.
Google Scholar
Kanji, G. K., 100 Statistical Tests. 3 rd ed.; Sage Publications, London: 2006.
Google Scholar
Bulmer, M. G., Principles of Statistics. Dover, USA: 1979.
Google Scholar
Keeping, E. S., Introduction to Statistical Inference. Dover, USA: 1995.
Google Scholar
van Belle, G., Statistical Rules of Thumb. Wiley, New York: 2002.
Google Scholar
Pepe, M. S., The Statistical Evaluation of Medical Tests for Classifaction and Prediction. Oxford University Press: 2004.
Google Scholar
Good, P. I.; Hardin, J. W., Common Errors in Statistics (and How to Avoid Them). 2nd ed.; Wiley-InterScience, New Jersey: 2006.
Book Google Scholar
Moye, L. A., Statistical Reasoning in Medicine. 2nd ed.; Springer, New York: 2006.
Book Google Scholar
Silvia, D. S., Data Analysis: A Bayesian Tutorial. Oxford Science Publications: 1996.
Google Scholar
Marin, J. -M.; Robert, C. P., Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Springer, New York: 2007.
Google Scholar
Carlin, B. P.; Loius, T. A., Bayes and Empirical Bayes Methods for Data Analysis. 2nd ed.; Chapman & Hall/CRC, Boca Raton, FL: 2000.
Book Google Scholar
Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G., Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273–1280.
Article PubMed CAS Google Scholar
Vidal, D.; Thormann, M.; Pons, M., LINGO, an efficient holographic text based method to calculate biophysical properties and intermolecular similarities. J. Complement. Integr. Med. 2005, 45, 386–393.
CAS Google Scholar

Download references

Author information

Authors and Affiliations

OpenEye Scientific Software, Santa Fe, NM, USA
Anthony Nicholls

Authors

Anthony Nicholls
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

, Department of Life Science Informatics, Rheinische Friedrich-Wilhelms-Universitä, Dahlmannstr. 2, Bonn, 53113, Germany
Jürgen Bajorath

Appendices

Appendix 1: Using logit to Get Error Bounds

In the section on enrichment as a metric for virtual screening we arrived at a formula for the variance of the enrichment.

$$ {\text{SEM}}(E({f_{\text{i}}})) \approx \frac{1}{{{f_{\text{i}}}}}\sqrt {{\frac{{{f_{\text{a}}}(1 - {f_{\text{a}}})}}{{{N_{\text{a}}}}}}} \left( {1 + \frac{1}{2} s{{({f_{\text{i}}})}^2}\frac{1}{{RE}}\frac{{(1 - {f_{\text{i}}})}}{{(1 - {f_{\text{a}}})}}} \right) $$

(60)

Let’s assume that R, the ratio of inactives to actives, is very large so we just have:

$$ {\text{SEM}}(E({f_{\text{i}}})) = \frac{1}{{{f_{\text{i}}}}}\sqrt {{\frac{{{f_{\text{a}}}(1 - {f_{\text{a}}})}}{{{N_{\text{a}}}}}}} $$

(61)

In our example, the ROC enrichment was fivefold at 1% inactives because the fraction of actives was 0.05. This would mean the 95% error would be:

$$ \begin{gathered} {\text{Err}}(95\% |{f_{\text{i}}} = 0.01) = \pm 1.96 \times 100\sqrt {{\frac{{0.05 \times 0.95}}{{{N_{\text{a}}}}}}} \hfill \\ {\text{Err}}(95\% |{f_{\text{i}}} = 0.01) = \pm 40.5/\sqrt {{{N_{\text{a}}}}} \hfill \\ \end{gathered} $$

(62)

Now the enrichment is 5.0. If N _a < (40.5/5.0)², i.e., N _a < 65, then the lower error bound becomes negative, i.e., a nonsense value. The problem is, as mentioned in the text, that the quantity of interest, the fraction f _a of actives, is bounded between 0 and 1. However, if we transform with the logit function it becomes unbounded, just like the Gaussian.

$$ y = l(x) = \log (\frac{x}{{1 - x}}) $$

(63)

If we make this transformation then the fraction f _a = 0.05 becomes:

$$ l(0.05) = \log (\frac{{0.05}}{{1 - 0.05}}) = - 2.944 $$

(64)

This is our new mean. Now, we have to recalculate the variance in logit space, i.e.,

$$ \begin{gathered} {{\rm var}_l} \approx {\left( {\frac{1}{{ < x > (1 - < x > )}}} \right)^2}{{\rm var}_x} \\ = {\left( {\frac{1}{{0.05(1 - 0.05)}}} \right)^2}0.05 \times (1 - 0.05) \\ = 21.05 \\ \end{gathered} $$

(65)

This means the error bounds on the logit version of f _a become:

$$ \begin{gathered} l({f_{\text{i}}} = 0.01) = - 2.944\pm 1.96\sqrt {{\frac{{21.05}}{{{N_{\text{a}}}}}}} \hfill \\ = - 2.944\pm 8.99/\sqrt {{{N_{\text{a}}}}} \hfill \\ \end{gathered} $$

(66)

Suppose we set N _a to a value much less than the “silly” threshold of 65. Let’s make it 25. In non-logit space this means the 95% range of enrichments is:

$$ {\text{E}}\left( {{f_{\text{i}}} = 0.0{1}} \right) = \left[ {{5}.0 - {4}0.{5}/{5}.0,{5}.0 + {4}0.{5}/{5}.0} \right] $$

(67)

$$ = \left[ { - {3}.{1},{13}.{1}} \right] $$

Clearly, the lower range is nonsense. Now consider the range of f _a in logit space:

$$ {f_{\text{a}}}{,_{\text{logit}}}\left( {{f_{\text{i}}} = 0.0{1}} \right) = \left[ { - {2}.{944}--{8}.{99}/{5}, - {2}.{944} + {8}.{99}/{5}} \right] $$

(68)

$$ = \left[ { - {4}.{8}, - {1}.{144}} \right] $$

Now it is perfectly ok that the lower range is negative because logit functions go from negative to positive infinity. The final step, then, is to transform these values back to a fraction, using the inverse logit function, i.e.,

$$ {l^{ - 1}}(y) = \frac{1}{{1 + {e^{ - y}}}} $$

(69)

And then divide by f _i to get the enrichment. If we do this we get:

$$ {f_{\text{a}}}\left( {{f_{\text{i}}} = 0.0{1}} \right) = \left[ {0.00{8},0.{247}} \right] $$

(70)

$$ E\left( {0.0{1}} \right) = \left[ {0.{8},{24}.{7}} \right] $$

Clearly, these are large error bounds, error bounds that actually include an enrichment of less than random! However, they are not negative and they are a reflection of the difficulty of pinning the enrichment down with so few actives. Even if we repeat the analysis with four times as many actives, i.e., N _a = 100, the 95% range is still [2.1, 11.5]. The untransformed range for N _a = 100 is ~[1.0, 9.0].

Appendix 2: Why Variances Add

Suppose we have two sources of error that can move the measured value away from its true mean, and let’s suppose that mean value is zero for simplicity. The CLT tells us that each source alone will produce a distribution of values according to the number of observations and the intrinsic variance of each source:

$$ pd{f_\alpha }(x) = \sqrt {{\frac{N}{{2\pi \sigma_{^\alpha }^2}}}} {e^{ - {x^2}N/2\sigma_{^\alpha }^2}};\quad \quad pd{f_\beta }(y) = \sqrt {{\frac{N}{{2\pi \sigma_\beta^2}}}} {e^{ - {y^2}N/2\sigma_{^\beta }^2}} $$

(71)

Now x and y are independent variations from the mean; therefore, the probability of observing an error of x from the first source and y from the second source has to be the joint probability, i.e.,

$$ pd{f_{\alpha, \beta }}(x,y) = \frac{N}{{2\pi {\sigma_\alpha }{\sigma_\beta }}}{e^{ - N({x^2}/2\sigma_{^\alpha }^2 + {y^2}/2\sigma_{^\beta }^2)}} $$

(72)

Now for such a combination of errors the total error is just (x + y). So what is the average square of the error, i.e., the variance, over all possible x and y? This is just the two dimensional averaging (i.e., integral) of (x + y)², weighted by pdf _{α, β}(x, y), i.e.,

$$ {\rm var} (x + y) = \frac{N}{{2 \pi {\sigma_\alpha }{\sigma_\beta }}} \iint\limits \begin{array}{l} x = - \infty, \infty \\ y = - \infty, \infty \end{array}{{{(x + y)}^2}}{e^{ - (N/2)({x^2}/\sigma_\alpha^2 + {y^2}/\sigma_\beta^2)}}{\rm d} x{\rm d} y $$

(73)

We can split this into three integrals by expanding (x + y)². Thus:

$$ \begin{array}{c} {\rm var} (x + y) = \frac{N}{{2\pi {\sigma_\alpha }{\sigma_\beta }}} \iint \limits_{l} x = - \infty, \infty \\ y = - \infty, \infty {{x^2}}{e^{ - (N/2)({x^2}/\sigma_\alpha^2 + {y^2}/\sigma_\beta^2)}} {\rm d} x{\rm d} y \\ + \frac{N}{{\pi {\sigma_\alpha }{\sigma_\beta }}} \iint \limits_{l} x = - \infty, \infty \\ y = - \infty, \infty {xy}{e^{ - (N/2)({x^2}/\sigma_\alpha^2 + {y^2}/\sigma_\beta^2)}} {\rm d} x{\rm d} y \\ + \frac{N}{{2\pi {\sigma_\alpha }{\sigma_\beta }}} \iint \limits_{l} x = - \infty, \infty \\ y = - \infty, \infty {{y^2}}{e^{ - (N/2)({x^2}/\sigma_\alpha^2 + {y^2}/\sigma_\beta^2)}}{\rm d} x{\rm d} y \end{array} $$

(74)

We can rewrite the first term as:

$$ \begin{array}{c} \frac{N}{{2\pi {\sigma_\alpha }{\sigma_\beta }}} \iint \limits_{x = - \infty, \infty y = - \infty, \infty} {{x^2}}{e^{ - (N/2)({x^2}/\sigma_\alpha^2 + {y^2}/\sigma_\beta^2)}}{\rm d} x{\rm d} y \\ = \frac{N}{{2\pi {\sigma_\alpha }{\sigma_\beta }}}\int\limits_{x = - \infty }^\infty {{x^2}{e^{ - (N/2)({x^2}/\sigma_\alpha^2)}}{\rm d} x} \int\limits_{y = - \infty }^\infty {{e^{ - (N/2)({y^2}/\sigma_\beta^2)}}} {\rm d} y \end{array} $$

(75)

Therefore, we can integrate the integral over y independently. We can do the same thing for the third term for x. This leads to:

$$ \begin{array}{c} {\text{var}}(x + y) = \sqrt {{\frac{N}{{2\pi {\sigma_\alpha }}}}} \int\limits_{x = - \infty, \infty } {{x^2}{e^{ - (N/2)({x^2}/\sigma_\alpha^2)}}{\rm d} x} \\ + \frac{N}{{\pi {\sigma_\alpha }{\sigma_\beta }}}\iint\limits_{l} x = - \infty, \infty \\ y = - \infty, \infty {xy}{e^{ - (N/2)({x^2}/\sigma_\alpha^2 + {y^2}/\sigma_\beta^2)}}{\rm d} x{\rm d} y \\ + \sqrt {{\frac{N}{{2\pi {\sigma_\beta }}}}} \int\limits_{y = - \infty, \infty } {{y^2}{e^{ - (N/2)({y^2}/\sigma_\beta^2)}}{\rm d} y} \end{array} $$

(76)

Now, given that the mean is zero, the first term is just the integral for the variance due to x, the third term is the integral for the variance due to y, and the second term must be zero because it separates into the product of two integrals each of which must be zero as they calculate the average value of x and y, respectively, both zero. Therefore:

$$ {\rm var} (x + y) = {\rm var} (x) + {\rm var} (y) $$

(77)

The astute reader will notice that we could have performed the same sequence of operations with any pdf, not just a Gaussian, and arrived at the same conclusion. The key steps are multiplying the individual pdfs together, separating the resultant integral into three integrals, two of which are the individual variances and the third must be equal to zero because we defined the mean of each pdf to zero. That is, this is a general result, not just one pertaining to Gaussian forms of the distribution function.

Appendix 3: Deriving the Hanley Formula for AUC Error

Recall that we have the following equation for the variance of either the actives or the inactives, where w = AUC for the former and w = 1 − AUC for the later:

$$ {\text{Var}} = \frac{{{w^2}(1 - w)}}{{1 + w}} $$

(78)

The assumption by Hanley is that the pdf for both actives and inactives follows an exponential distribution, e.g.

$$ \begin{gathered} x \geqslant 0 \hfill \\ {p_{\text{active}}}(x) = {\lambda_{\text{a}}}{e^{ - {\lambda_{\text{a}}}x}} \hfill \\ {p_{\text{inactive}}}(x) = {\lambda_{\text{i}}}{e^{ - {\lambda_{\text{i}}}x}} \hfill \\ x < 0 \hfill \\ {p_{\text{active}}}(x) = {p_{\text{inactive}}}(x) = 0 \hfill \\ \end{gathered} $$

(79)

Here, x is a score for either active or inactive that determines its rank (higher = better). These forms integrate from 0 to positive infinity to 1.0 as required. Since we can always rescale x by a constant and still have the same rankings, let’s set the lambda for inactives to 1, i.e.,

$$ \begin{gathered} {p_{\text{active}}}(x) = \lambda {e^{ - \lambda x}} \hfill \\ {p_{\text{inactive}}}(x) = {e^{ - x}} \hfill \\ \end{gathered} $$

(80)

Given these probability density functions we can write down an expression for the AUC either in terms of the fraction of inactives of lower score than each active, or as the fraction of actives higher than each inactive. The math is a little cleaner if we do the latter:

$$ {\text{AUC}} = \int\limits_{x = 0}^{x = + \infty } {{p_{\text{inactive}}}(x)\int\limits_{y = x}^{y = + \infty } {{p_{\text{inactive}}}(y){\rm d} y{\rm d} x} } $$

(81)

The first term in the integral is the density of inactives at score x and this is multiplied by the fraction of actives with a score greater than x. If we substitute the Hanley pdfs we get:

$$ \begin{gathered} {\text{AUC}} = \int\limits_{x = 0}^{x = \infty } {{e^{ - x}}} \int\limits_{y = x}^{y = \infty } {\lambda {e^{ - \lambda y}}} {\rm d} y{\rm d} x \\ = \int\limits_{x = 0}^{x = \infty } {{e^{ - x}}} {e^{ - \lambda x}}{\rm d} x \\ = \frac{1}{{1 + \lambda }} \\ \end{gathered} $$

(82)

We can see that this looks correct because if lambda is greater than one the scores of the actives must fall off more quickly than the inactives and the AUC will be less than 0.5, but if it is less than one it has a longer tail of positive scores and so has an AUC greater than 0.5. Now, let’s consider the variance for the inactives:

$$ {\text{Va}}{{\text{r}}_{\text{inactives}}} = \int\limits_{x = 0}^{x = \infty } {{e^{ - x}}} {\left[ {\int\limits_{y = x}^{y = \infty } {\lambda {e^{ - \lambda y}}{\rm d} y} } \right]^2}{\rm d} x - {\text{AU}}{{\text{C}}^2} $$

(83)

This is just the equivalent of <p> − <p> ² we normally see for a variance but we are integrating over the pdf for inactives. Expanding and solving the integral we get:

$$ \begin{gathered} {\text{Va}}{{\text{r}}_{\text{inactives}}} = \int\limits_{x = 0}^{x = \infty } {{e^{ - x}}{e^{ - 2\lambda x}}{\rm d} x - {\text{AU}}{{\text{C}}^2}} \\ = \frac{1}{{1 + 2\lambda }} - {\text{AU}}{{\text{C}}^2} \\ \end{gathered} $$

(84)

Now, the nice thing about the Hanley choice is that we can substitute for lambda from the AUC, i.e.,

$$ \begin{gathered} {\text{AUC}} = \frac{1}{{1 + \lambda }} \\ \lambda = \frac{{1 - {\text{AUC}}}}{\text{AUC}} \\ {\text{Va}}{{\text{r}}_{\text{inactives}}} = \frac{\text{AUC}}{{2 - {\text{AUC}}}} - {\text{AU}}{{\text{C}}^2} \\ = \frac{{{\text{AUC}}{{(1 - {\text{AUC}})}^2}}}{{2 - {\text{AUC}}}} \\ \end{gathered} $$

(85)

Setting w = 1 − AUC, we get:

$$ {\text{Va}}{{\text{r}}_{\text{inactives}}} = \frac{{{w^2}(1 - w)}}{{1 + w}} $$

(86)

And the result required is obtained. A further nice thing about the Hanley pdf is that we can get a simple expression for the ROC curve. If we want to know what fraction, f, of inactives or actives have a score greater than z we have:

$$ \begin{array} {c} {f_{{inactive}}}(z) = \int\limits_{y = z}^{y = \infty } {{e^{ - y}}} { d} y \cr = {e^{ - z}} \\{f_{{active}}}(z) = {e^{ - \lambda z}} \end{array} $$

(87)

But (f(z)_inactive, f(z)_active) are the points on the ROC curve, parameterized by z. Therefore, to express the one in terms of the other we simply have:

$$ \begin{gathered} x = {e^{ - z}} \hfill \\ y = {e^{ - \lambda z}} \hfill \\ \therefore y = {x^\lambda } \hfill \\ y = {x^{\frac{{1 - {\text{AUC}}}}{\text{AUC}}}} \hfill \\ \end{gathered} $$

(88)

This is the form of the Hanley ROC curve for a given AUC value. It can be a pretty good fit to real data!

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Nicholls, A. (2011). What Do We Know?: Simple Statistical Techniques that Help. In: Bajorath, J. (eds) Chemoinformatics and Computational Chemical Biology. Methods in Molecular Biology, vol 672. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-60761-839-3_22

Download citation

DOI: https://doi.org/10.1007/978-1-60761-839-3_22
Published: 28 August 2010
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-60761-838-6
Online ISBN: 978-1-60761-839-3
eBook Packages: Springer Protocols

Publish with us

Policies and ethics