Abstract
This chapter discusses the basic principles of classical statistical significance testing (Sect. 1.1) and defines some well-known probability distributions that are necessary for discussing parametric significance tests (Sect. 1.2). (“A problem is parametric if the form of the underlying distribution is known, and it is nonparametric if we have no knowledge concerning the distribution(s) from which the observations are drawn.” Good (Permutation, parametric, and bootstrap tests of hypothesis, 3rd edn. Springer, New York, 2005, p. 14). For example, the paired t-test is a parametric test for paired data as it relies on the assumption that the observed data independently obey normal distributions (See Chap. 2 Sect. 2.2); the sign test is a nonparametric test; they may be applied to the same data if the normality assumption is not valid. This book only discusses parametric tests for comparing means, namely, t-tests and ANOVAs. See Chap. 2 for a discussion on the robustness of the t-test to the normality assumption violation.) As this book is intended for IR researchers such as myself, not statisticians, well-known theorems are presented without proofs; only brief proofs for corollaries are given. In the next two chapters, we shall use these basic theorems and corollaries as black boxes just as programmers utilise standard libraries when writing their own code. This chapter also defines less well-known distributions called noncentral distributions (Sect. 1.3), which we shall need for discussing sample size design and power analysis in Chaps. 6 and 7. Hence Sect. 1.3 may be skipped if the reader only wishes to learn about the principles and limitations of significance testing; however, such readers should read up to Chap. 5 before abandoning this book.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Henceforth, for simplicity, I will basically ignore the difference between “topics” (i.e., information need statements) and “queries” (i.e., the sets of keywords input to the search engine) and that between “queries” and the “scores” computed for their search results.
- 2.
- 3.
Is a TREC (Text REtrieval Conference) topic set a random sample? Probably not. However, the reader should be aware that IR researchers who rely on significance tests such as t-tests and ANOVA for comparing system means implicitly rely on the basic assumption that a topic set is a random sample. The exact assumptions for t-tests and ANOVA are stated in Chaps. 2 and 3. Also, a computer-based significance test that does not rely on the random sampling and any distributional assumptions will be described in Chap. 4 Sect. 4.5.
- 4.
In this book, a random variable and its realisations are denoted by the same symbol, e.g. x.
- 5.
- 6.
- 7.
A one-sided test of the form H 1 : μ X > μ Y would make sense if either μ X < μ Y is simply impossible, or if you do not need to consider the case where μ X < μ Y even if this is possible [12]. For example, if you are measuring the effect of introducing an aggressive stemming algorithm into your IR system in terms of recall (not precision) and you know that this can never hurt recall, a one-side test may be appropriate. But in practice, when you propose a new IR algorithm and want to compare it with a competitive baseline, it is rarely the case that you know in advance that your proposal is better. Hence I recommend the two-sided test as default. Whichever you choose, hypotheses H 0 and H 1 must be set up before actually examining the data.
- 8.
“Dichotomous thinking,” one of the major reasons why classical significance testing has been severely criticised for decades, will be discussed in Chap. 5.
- 9.
For a discussion on the difference between convergence in probability (which is used in the weak law of large numbers) and almost sure convergence (which is used in the strong law of large numbers), see, for example, https://stats.stackexchange.com/questions/2230/convergence-in-probability-vs-almost-sure-convergence.
- 10.
With Microsoft Excel, z inv(P) can be obtained as NORM.S.INV(1 − P); with R, it can be obtained as qnorm( P, lower.tail=FALSE).
- 11.
It is often recommended that a binomial distribution be approximated with a normal distribution provided nP ≥ 5 and n(1 − P) ≥ 5 hold [12]. Note that when n = 15andP = 0.5, we have nP = n(1 − P) = 7.5 > 5, whereas, when n = 15andP = 0.2, we have nP = 3 < 5. In the latter situation, the above recommendation suggests that we should increase the sample size to n = 25: however, note that even this is not a large number.
- 12.
The sample mean uses n as the denominator because it is based on n independent pieces of information, namely, x i. In contrast, while Eq. 1.13 shows that S is based on \((x_{i}-\bar {x})\)’s, these are not actually n independent pieces of information, since \(\sum _{i=1}^{n}(x_{i}-\bar {x}) = \sum _{i}^{n}x_{i} - n\bar {x} = 0\) holds. There are only (n − 1) independent pieces of information in the sense that, once we have decided on (n − 1) values out of n, the last one is automatically determined due to the above constraint. For this reason, dividing S by n − 1 makes sense [12]. See also Sect. 1.2.4 where the degrees of freedom of S is discussed.
- 13.
To be more specific, \(E(S/n)=\frac {n-1}{n}\sigma ^2 < \sigma ^2\) and hence S∕n underestimates σ 2 [12]. Also of note is that the sample standard deviation \(\sqrt {V}\) is not an unbiased estimator of the population standard deviation σ despite the fact that E(V ) = σ 2.
- 14.
It is known that the population mean and the population variance of χ 2 are given by E(χ 2) = ϕ and V (χ 2) = 2ϕ, respectively.
- 15.
With Microsoft Excel, \(\chi _{\mathit {inv}}^{2}(\phi ; P)\) can be obtained as CHISQ.INV.RT(P, ϕ); with R, it can be obtained as qchisq( P, ϕ, lower.tail=FALSE).
- 16.
It is known that the population mean and the population variance of t are given by E(t) = 0 (for ϕ ≥ 2) and \(V(t)=\frac {\phi }{\phi -2}\) (for ϕ ≥ 3), respectively.
- 17.
With Microsoft Excel, t inv(ϕ; P) can be obtained as T.INV.2T(P, ϕ); with R, it can be obtained as qt( P∕2, ϕ, lower.tail=FALSE).
- 18.
It is known that the population mean and the population variance of F are given by \(E(F)=\frac {\phi _{2}}{\phi _{2}-2}\) (for ϕ 2 ≥ 3) and \(V(F)=\frac {2 \phi _{2}^2(\phi _{1}+\phi _{2}-2)}{\phi _{1}(\phi _{2}-2)^2(\phi _{2}-4)}\) (for ϕ 2 ≥ 5), respectively.
- 19.
With Microsoft Excel, F inv(ϕ 1, ϕ 2; P) can be obtained as F.INV.RT(P, ϕ 1, ϕ 2); with R, it can be obtained as qf( P, ϕ 1, ϕ 2, lower.tail=FALSE).
- 20.
It is known that the population mean and the population variance of t ′ are given by \(E(t^{\prime })= \frac {\lambda \sqrt {\phi /2} {\Gamma }((\phi -1)/2)}{{\Gamma }(\phi /2)}\) (for ϕ ≥ 2) and \(V(t^{\prime })= \frac {\phi (1+\lambda ^2)}{\phi -2} - \{E(t^{\prime })\}^2\) (for ϕ ≥ 3), respectively.
- 21.
It is known that the population mean and the population variance of χ ′2 are given by E(χ ′2) = ϕ + λ and V (χ ′2) = 2(ϕ + 2λ), respectively.
- 22.
It is known that the population mean and the population variance of F ′ are given by \(E(F^{\prime }) = \frac {\phi _{2}(\phi _{1} + \lambda )}{\phi _{1}(\phi _{2}-2)}\) (for ϕ 2 ≥ 3) and \(V(F^{\prime })=2\left (\frac {\phi _{2}}{\phi _{1}}\right )^2 \frac {(\phi _{1}+\lambda )^2 + (\phi _{1}+2\lambda )(\phi _{2}-2)}{(\phi _{2}-2)^2(\phi _{2}-4)} \) (for ϕ 2 ≥ 5), respectively.
References
C. Buckley, E.M. Voorhees, Retrieval system evaluation, in TREC: Experiment and Evaluation in Information Retrieval, Chap. 3, ed. by E.M. Voorhees, D.K. Harman (The MIT Press, Cambridge, 2005), pp. 53–75
B. Carterette, Bayesian inference for information retrieval evaluation, in Proceedings of ACM ICTIR, Northampton, 2015, pp. 31–40
J. Cohen, Statistical Power Analysis for the Bahavioral Sciences, 2nd edn. (Psychology Press, New York, 1988)
G.V. Cormack, C.R. Palmer, C.L.A. Clarke, Efficient construction of large test collections, in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 282–289
G. Cumming, Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis (Routledge, New York/London, 2012)
P.D. Ellis, The Essential Guide to Effect Sizes (Cambridge University Press, Cambridge/New York, 2010)
P. Good, Permutation, Parametric, and Bootstrap Tests of Hypothesis, 3rd edn. (Springer, New York, 2005)
R.J. Grissom, J.J. Kim, Effect Sizes for Research, 2nd edn. (Routledge, New York, 2012)
K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)
J.K. Kruschke, Doing Bayesian Data Analysis, 2nd edn. (Elsevier, Amsterdam, 2015)
K.R. Murphy, B. Myors, A. Wolach, Statistical Power Analysis, 4th edn. (Routledge, New York, 2014)
Y. Nagata, How to Understand Statistical Methods (in Japanese) (JUSE Press, Shibuya, 1996)
Y. Nagata, How to Design the Sample Size (in Japanese) (Asakura Shoten, Shinjuku, 2003)
S.E. Robertson, On document populations and measures of IR effectiveness, in Proceedings of ICTIR, Budapest, 2007, pp. 9–22
S.E. Robertson, E. Kanoulas, On per-topic variance in IR evaluation, in Proceedings of ACM SIGIR, Portland, 2012, pp. 891–900
T. Sakai, Topic set size design. Inf. Retr. 19(3), 256–283 (2016)
T. Sakai, The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation, in Proceedings of ACM SIGIR, Shinjuku, 2017, pp. 25–34
H. Toyoda, (ed.), Fundamentals of Bayesian Statistics: Practical Getting Started by Hamiltonian Monte Carlo Method (in Japanese) (Asakura Shoten, Shinjuku, 2015)
H. Toyoda. An Introduction to Statistical Data Analysis: Bayesian Statistics for ‘post p-value era’ (in Japanese) (Asakuha Shoten, Shinjuku, 2016)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Sakai, T. (2018). Preliminaries. In: Laboratory Experiments in Information Retrieval. The Information Retrieval Series, vol 40. Springer, Singapore. https://doi.org/10.1007/978-981-13-1199-4_1
Download citation
DOI: https://doi.org/10.1007/978-981-13-1199-4_1
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1198-7
Online ISBN: 978-981-13-1199-4
eBook Packages: Computer ScienceComputer Science (R0)