Preliminaries

Sakai, Tetsuya

doi:10.1007/978-981-13-1199-4_1

Tetsuya Sakai⁴

Part of the book series: The Information Retrieval Series ((INRE,volume 40))

1042 Accesses

Abstract

This chapter discusses the basic principles of classical statistical significance testing (Sect. 1.1) and defines some well-known probability distributions that are necessary for discussing parametric significance tests (Sect. 1.2). (“A problem is parametric if the form of the underlying distribution is known, and it is nonparametric if we have no knowledge concerning the distribution(s) from which the observations are drawn.” Good (Permutation, parametric, and bootstrap tests of hypothesis, 3rd edn. Springer, New York, 2005, p. 14). For example, the paired t-test is a parametric test for paired data as it relies on the assumption that the observed data independently obey normal distributions (See Chap. 2 Sect. 2.2); the sign test is a nonparametric test; they may be applied to the same data if the normality assumption is not valid. This book only discusses parametric tests for comparing means, namely, t-tests and ANOVAs. See Chap. 2 for a discussion on the robustness of the t-test to the normality assumption violation.) As this book is intended for IR researchers such as myself, not statisticians, well-known theorems are presented without proofs; only brief proofs for corollaries are given. In the next two chapters, we shall use these basic theorems and corollaries as black boxes just as programmers utilise standard libraries when writing their own code. This chapter also defines less well-known distributions called noncentral distributions (Sect. 1.3), which we shall need for discussing sample size design and power analysis in Chaps. 6 and 7. Hence Sect. 1.3 may be skipped if the reader only wishes to learn about the principles and limitations of significance testing; however, such readers should read up to Chap. 5 before abandoning this book.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Henceforth, for simplicity, I will basically ignore the difference between “topics” (i.e., information need statements) and “queries” (i.e., the sets of keywords input to the search engine) and that between “queries” and the “scores” computed for their search results.
2.
While this book primarily discusses sample means over n topics, some IR researchers have explored the approach of regarding the document collection used in an experiment as a sample from a large population of documents [4, 14, 15].
3.
Is a TREC (Text REtrieval Conference) topic set a random sample? Probably not. However, the reader should be aware that IR researchers who rely on significance tests such as t-tests and ANOVA for comparing system means implicitly rely on the basic assumption that a topic set is a random sample. The exact assumptions for t-tests and ANOVA are stated in Chaps. 2 and 3. Also, a computer-based significance test that does not rely on the random sampling and any distributional assumptions will be described in Chap. 4 Sect. 4.5.
4.
In this book, a random variable and its realisations are denoted by the same symbol, e.g. x.
5.
In contrast to classical significance testing, Bayesian statistics [2, 10, 17,18,19] treat population parameters as random variables. See also Chap. 8.
6.
Not to be confused with two-sample tests, which means you have a sample for System X and a different sample for System Y , possibly with different sample sizes (See Chap. 2 Sect. 2.3).
7.
A one-sided test of the form H ₁ : μ _X > μ _Y would make sense if either μ _X < μ _Y is simply impossible, or if you do not need to consider the case where μ _X < μ _Y even if this is possible [12]. For example, if you are measuring the effect of introducing an aggressive stemming algorithm into your IR system in terms of recall (not precision) and you know that this can never hurt recall, a one-side test may be appropriate. But in practice, when you propose a new IR algorithm and want to compare it with a competitive baseline, it is rarely the case that you know in advance that your proposal is better. Hence I recommend the two-sided test as default. Whichever you choose, hypotheses H ₀ and H ₁ must be set up before actually examining the data.
8.
“Dichotomous thinking,” one of the major reasons why classical significance testing has been severely criticised for decades, will be discussed in Chap. 5.
9.
For a discussion on the difference between convergence in probability (which is used in the weak law of large numbers) and almost sure convergence (which is used in the strong law of large numbers), see, for example, https://stats.stackexchange.com/questions/2230/convergence-in-probability-vs-almost-sure-convergence.
10.
With Microsoft Excel, z _inv(P) can be obtained as NORM.S.INV(1 − P); with R, it can be obtained as qnorm( P, lower.tail=FALSE).
11.
It is often recommended that a binomial distribution be approximated with a normal distribution provided nP ≥ 5 and n(1 − P) ≥ 5 hold [12]. Note that when n = 15andP = 0.5, we have nP = n(1 − P) = 7.5 > 5, whereas, when n = 15andP = 0.2, we have nP = 3 < 5. In the latter situation, the above recommendation suggests that we should increase the sample size to n = 25: however, note that even this is not a large number.
12.
The sample mean uses n as the denominator because it is based on n independent pieces of information, namely, x _i. In contrast, while Eq. 1.13 shows that S is based on \((x_{i}-\bar {x})\)’s, these are not actually n independent pieces of information, since \(\sum _{i=1}^{n}(x_{i}-\bar {x}) = \sum _{i}^{n}x_{i} - n\bar {x} = 0\) holds. There are only (n − 1) independent pieces of information in the sense that, once we have decided on (n − 1) values out of n, the last one is automatically determined due to the above constraint. For this reason, dividing S by n − 1 makes sense [12]. See also Sect. 1.2.4 where the degrees of freedom of S is discussed.
13.
To be more specific, \(E(S/n)=\frac {n-1}{n}\sigma ^2 < \sigma ^2\) and hence S∕n underestimates σ ² [12]. Also of note is that the sample standard deviation \(\sqrt {V}\) is not an unbiased estimator of the population standard deviation σ despite the fact that E(V ) = σ ².
14.
It is known that the population mean and the population variance of χ ² are given by E(χ ²) = ϕ and V (χ ²) = 2ϕ, respectively.
15.
With Microsoft Excel, \(\chi _{\mathit {inv}}^{2}(\phi ; P)\) can be obtained as CHISQ.INV.RT(P, ϕ); with R, it can be obtained as qchisq( P, ϕ, lower.tail=FALSE).
16.
It is known that the population mean and the population variance of t are given by E(t) = 0 (for ϕ ≥ 2) and \(V(t)=\frac {\phi }{\phi -2}\) (for ϕ ≥ 3), respectively.
17.
With Microsoft Excel, t _inv(ϕ; P) can be obtained as T.INV.2T(P, ϕ); with R, it can be obtained as qt( P∕2, ϕ, lower.tail=FALSE).
18.
It is known that the population mean and the population variance of F are given by \(E(F)=\frac {\phi _{2}}{\phi _{2}-2}\) (for ϕ ₂ ≥ 3) and \(V(F)=\frac {2 \phi _{2}^2(\phi _{1}+\phi _{2}-2)}{\phi _{1}(\phi _{2}-2)^2(\phi _{2}-4)}\) (for ϕ ₂ ≥ 5), respectively.
19.
With Microsoft Excel, F _inv(ϕ ₁, ϕ ₂; P) can be obtained as F.INV.RT(P, ϕ ₁, ϕ ₂); with R, it can be obtained as qf( P, ϕ ₁, ϕ ₂, lower.tail=FALSE).
20.
It is known that the population mean and the population variance of t ^′ are given by \(E(t^{\prime })= \frac {\lambda \sqrt {\phi /2} {\Gamma }((\phi -1)/2)}{{\Gamma }(\phi /2)}\) (for ϕ ≥ 2) and \(V(t^{\prime })= \frac {\phi (1+\lambda ^2)}{\phi -2} - \{E(t^{\prime })\}^2\) (for ϕ ≥ 3), respectively.
21.
It is known that the population mean and the population variance of χ ^′2 are given by E(χ ^′2) = ϕ + λ and V (χ ^′2) = 2(ϕ + 2λ), respectively.
22.
It is known that the population mean and the population variance of F ^′ are given by \(E(F^{\prime }) = \frac {\phi _{2}(\phi _{1} + \lambda )}{\phi _{1}(\phi _{2}-2)}\) (for ϕ ₂ ≥ 3) and \(V(F^{\prime })=2\left (\frac {\phi _{2}}{\phi _{1}}\right )^2 \frac {(\phi _{1}+\lambda )^2 + (\phi _{1}+2\lambda )(\phi _{2}-2)}{(\phi _{2}-2)^2(\phi _{2}-4)} \) (for ϕ ₂ ≥ 5), respectively.

References

C. Buckley, E.M. Voorhees, Retrieval system evaluation, in TREC: Experiment and Evaluation in Information Retrieval, Chap. 3, ed. by E.M. Voorhees, D.K. Harman (The MIT Press, Cambridge, 2005), pp. 53–75
Google Scholar
B. Carterette, Bayesian inference for information retrieval evaluation, in Proceedings of ACM ICTIR, Northampton, 2015, pp. 31–40
Google Scholar
J. Cohen, Statistical Power Analysis for the Bahavioral Sciences, 2nd edn. (Psychology Press, New York, 1988)
Google Scholar
G.V. Cormack, C.R. Palmer, C.L.A. Clarke, Efficient construction of large test collections, in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 282–289
Google Scholar
G. Cumming, Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis (Routledge, New York/London, 2012)
Google Scholar
P.D. Ellis, The Essential Guide to Effect Sizes (Cambridge University Press, Cambridge/New York, 2010)
Book Google Scholar
P. Good, Permutation, Parametric, and Bootstrap Tests of Hypothesis, 3rd edn. (Springer, New York, 2005)
MATH Google Scholar
R.J. Grissom, J.J. Kim, Effect Sizes for Research, 2nd edn. (Routledge, New York, 2012)
Google Scholar
K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)
Article Google Scholar
J.K. Kruschke, Doing Bayesian Data Analysis, 2nd edn. (Elsevier, Amsterdam, 2015)
MATH Google Scholar
K.R. Murphy, B. Myors, A. Wolach, Statistical Power Analysis, 4th edn. (Routledge, New York, 2014)
Book Google Scholar
Y. Nagata, How to Understand Statistical Methods (in Japanese) (JUSE Press, Shibuya, 1996)
Google Scholar
Y. Nagata, How to Design the Sample Size (in Japanese) (Asakura Shoten, Shinjuku, 2003)
Google Scholar
S.E. Robertson, On document populations and measures of IR effectiveness, in Proceedings of ICTIR, Budapest, 2007, pp. 9–22
Google Scholar
S.E. Robertson, E. Kanoulas, On per-topic variance in IR evaluation, in Proceedings of ACM SIGIR, Portland, 2012, pp. 891–900
Google Scholar
T. Sakai, Topic set size design. Inf. Retr. 19(3), 256–283 (2016)
Article Google Scholar
T. Sakai, The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation, in Proceedings of ACM SIGIR, Shinjuku, 2017, pp. 25–34
Google Scholar
H. Toyoda, (ed.), Fundamentals of Bayesian Statistics: Practical Getting Started by Hamiltonian Monte Carlo Method (in Japanese) (Asakura Shoten, Shinjuku, 2015)
Google Scholar
H. Toyoda. An Introduction to Statistical Data Analysis: Bayesian Statistics for ‘post p-value era’ (in Japanese) (Asakuha Shoten, Shinjuku, 2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Waseda University, Tokyo, Japan
Tetsuya Sakai

Authors

Tetsuya Sakai
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sakai, T. (2018). Preliminaries. In: Laboratory Experiments in Information Retrieval. The Information Retrieval Series, vol 40. Springer, Singapore. https://doi.org/10.1007/978-981-13-1199-4_1

Download citation

DOI: https://doi.org/10.1007/978-981-13-1199-4_1
Published: 23 September 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1198-7
Online ISBN: 978-981-13-1199-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics