Skip to main content

Part of the book series: The Information Retrieval Series ((INRE,volume 40))

  • 1042 Accesses

Abstract

This chapter discusses the basic principles of classical statistical significance testing (Sect. 1.1) and defines some well-known probability distributions that are necessary for discussing parametric significance tests (Sect. 1.2). (“A problem is parametric if the form of the underlying distribution is known, and it is nonparametric if we have no knowledge concerning the distribution(s) from which the observations are drawn.” Good (Permutation, parametric, and bootstrap tests of hypothesis, 3rd edn. Springer, New York, 2005, p. 14). For example, the paired t-test is a parametric test for paired data as it relies on the assumption that the observed data independently obey normal distributions (See Chap. 2 Sect. 2.2); the sign test is a nonparametric test; they may be applied to the same data if the normality assumption is not valid. This book only discusses parametric tests for comparing means, namely, t-tests and ANOVAs. See Chap. 2 for a discussion on the robustness of the t-test to the normality assumption violation.) As this book is intended for IR researchers such as myself, not statisticians, well-known theorems are presented without proofs; only brief proofs for corollaries are given. In the next two chapters, we shall use these basic theorems and corollaries as black boxes just as programmers utilise standard libraries when writing their own code. This chapter also defines less well-known distributions called noncentral distributions (Sect. 1.3), which we shall need for discussing sample size design and power analysis in Chaps. 6 and 7. Hence Sect. 1.3 may be skipped if the reader only wishes to learn about the principles and limitations of significance testing; however, such readers should read up to Chap. 5 before abandoning this book.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Henceforth, for simplicity, I will basically ignore the difference between “topics” (i.e., information need statements) and “queries” (i.e., the sets of keywords input to the search engine) and that between “queries” and the “scores” computed for their search results.

  2. 2.

    While this book primarily discusses sample means over n topics, some IR researchers have explored the approach of regarding the document collection used in an experiment as a sample from a large population of documents [4, 14, 15].

  3. 3.

    Is a TREC (Text REtrieval Conference) topic set a random sample? Probably not. However, the reader should be aware that IR researchers who rely on significance tests such as t-tests and ANOVA for comparing system means implicitly rely on the basic assumption that a topic set is a random sample. The exact assumptions for t-tests and ANOVA are stated in Chaps. 2 and 3. Also, a computer-based significance test that does not rely on the random sampling and any distributional assumptions will be described in Chap. 4 Sect. 4.5.

  4. 4.

    In this book, a random variable and its realisations are denoted by the same symbol, e.g. x.

  5. 5.

    In contrast to classical significance testing, Bayesian statistics [2, 10, 17,18,19] treat population parameters as random variables. See also Chap. 8.

  6. 6.

    Not to be confused with two-sample tests, which means you have a sample for System X and a different sample for System Y , possibly with different sample sizes (See Chap. 2 Sect. 2.3).

  7. 7.

    A one-sided test of the form H 1 : μ X > μ Y would make sense if either μ X < μ Y is simply impossible, or if you do not need to consider the case where μ X < μ Y even if this is possible [12]. For example, if you are measuring the effect of introducing an aggressive stemming algorithm into your IR system in terms of recall (not precision) and you know that this can never hurt recall, a one-side test may be appropriate. But in practice, when you propose a new IR algorithm and want to compare it with a competitive baseline, it is rarely the case that you know in advance that your proposal is better. Hence I recommend the two-sided test as default. Whichever you choose, hypotheses H 0 and H 1 must be set up before actually examining the data.

  8. 8.

    “Dichotomous thinking,” one of the major reasons why classical significance testing has been severely criticised for decades, will be discussed in Chap. 5.

  9. 9.

    For a discussion on the difference between convergence in probability (which is used in the weak law of large numbers) and almost sure convergence (which is used in the strong law of large numbers), see, for example, https://stats.stackexchange.com/questions/2230/convergence-in-probability-vs-almost-sure-convergence.

  10. 10.

    With Microsoft Excel, z inv(P) can be obtained as NORM.S.INV(1 − P); with R, it can be obtained as qnorm( P, lower.tail=FALSE).

  11. 11.

    It is often recommended that a binomial distribution be approximated with a normal distribution provided nP ≥ 5 and n(1 − P) ≥ 5 hold [12]. Note that when n = 15andP = 0.5, we have nP = n(1 − P) = 7.5 > 5, whereas, when n = 15andP = 0.2, we have nP = 3 < 5. In the latter situation, the above recommendation suggests that we should increase the sample size to n = 25: however, note that even this is not a large number.

  12. 12.

    The sample mean uses n as the denominator because it is based on n independent pieces of information, namely, x i. In contrast, while Eq. 1.13 shows that S is based on \((x_{i}-\bar {x})\)’s, these are not actually n independent pieces of information, since \(\sum _{i=1}^{n}(x_{i}-\bar {x}) = \sum _{i}^{n}x_{i} - n\bar {x} = 0\) holds. There are only (n − 1) independent pieces of information in the sense that, once we have decided on (n − 1) values out of n, the last one is automatically determined due to the above constraint. For this reason, dividing S by n − 1 makes sense [12]. See also Sect. 1.2.4 where the degrees of freedom of S is discussed.

  13. 13.

    To be more specific, \(E(S/n)=\frac {n-1}{n}\sigma ^2 < \sigma ^2\) and hence Sn underestimates σ 2 [12]. Also of note is that the sample standard deviation \(\sqrt {V}\) is not an unbiased estimator of the population standard deviation σ despite the fact that E(V ) = σ 2.

  14. 14.

    It is known that the population mean and the population variance of χ 2 are given by E(χ 2) = ϕ and V (χ 2) = 2ϕ, respectively.

  15. 15.

    With Microsoft Excel, \(\chi _{\mathit {inv}}^{2}(\phi ; P)\) can be obtained as CHISQ.INV.RT(P, ϕ); with R, it can be obtained as qchisq( P, ϕ, lower.tail=FALSE).

  16. 16.

    It is known that the population mean and the population variance of t are given by E(t) = 0 (for ϕ ≥ 2) and \(V(t)=\frac {\phi }{\phi -2}\) (for ϕ ≥ 3), respectively.

  17. 17.

    With Microsoft Excel, t inv(ϕ; P) can be obtained as T.INV.2T(P, ϕ); with R, it can be obtained as qt( P∕2, ϕ, lower.tail=FALSE).

  18. 18.

    It is known that the population mean and the population variance of F are given by \(E(F)=\frac {\phi _{2}}{\phi _{2}-2}\) (for ϕ 2 ≥ 3) and \(V(F)=\frac {2 \phi _{2}^2(\phi _{1}+\phi _{2}-2)}{\phi _{1}(\phi _{2}-2)^2(\phi _{2}-4)}\) (for ϕ 2 ≥ 5), respectively.

  19. 19.

    With Microsoft Excel, F inv(ϕ 1, ϕ 2; P) can be obtained as F.INV.RT(P, ϕ 1, ϕ 2); with R, it can be obtained as qf( P, ϕ 1, ϕ 2, lower.tail=FALSE).

  20. 20.

    It is known that the population mean and the population variance of t are given by \(E(t^{\prime })= \frac {\lambda \sqrt {\phi /2} {\Gamma }((\phi -1)/2)}{{\Gamma }(\phi /2)}\) (for ϕ ≥ 2) and \(V(t^{\prime })= \frac {\phi (1+\lambda ^2)}{\phi -2} - \{E(t^{\prime })\}^2\) (for ϕ ≥ 3), respectively.

  21. 21.

    It is known that the population mean and the population variance of χ 2 are given by E(χ 2) = ϕ + λ and V (χ 2) = 2(ϕ + 2λ), respectively.

  22. 22.

    It is known that the population mean and the population variance of F are given by \(E(F^{\prime }) = \frac {\phi _{2}(\phi _{1} + \lambda )}{\phi _{1}(\phi _{2}-2)}\) (for ϕ 2 ≥ 3) and \(V(F^{\prime })=2\left (\frac {\phi _{2}}{\phi _{1}}\right )^2 \frac {(\phi _{1}+\lambda )^2 + (\phi _{1}+2\lambda )(\phi _{2}-2)}{(\phi _{2}-2)^2(\phi _{2}-4)} \) (for ϕ 2 ≥ 5), respectively.

References

  1. C. Buckley, E.M. Voorhees, Retrieval system evaluation, in TREC: Experiment and Evaluation in Information Retrieval, Chap. 3, ed. by E.M. Voorhees, D.K. Harman (The MIT Press, Cambridge, 2005), pp. 53–75

    Google Scholar 

  2. B. Carterette, Bayesian inference for information retrieval evaluation, in Proceedings of ACM ICTIR, Northampton, 2015, pp. 31–40

    Google Scholar 

  3. J. Cohen, Statistical Power Analysis for the Bahavioral Sciences, 2nd edn. (Psychology Press, New York, 1988)

    Google Scholar 

  4. G.V. Cormack, C.R. Palmer, C.L.A. Clarke, Efficient construction of large test collections, in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 282–289

    Google Scholar 

  5. G. Cumming, Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis (Routledge, New York/London, 2012)

    Google Scholar 

  6. P.D. Ellis, The Essential Guide to Effect Sizes (Cambridge University Press, Cambridge/New York, 2010)

    Book  Google Scholar 

  7. P. Good, Permutation, Parametric, and Bootstrap Tests of Hypothesis, 3rd edn. (Springer, New York, 2005)

    MATH  Google Scholar 

  8. R.J. Grissom, J.J. Kim, Effect Sizes for Research, 2nd edn. (Routledge, New York, 2012)

    Google Scholar 

  9. K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)

    Article  Google Scholar 

  10. J.K. Kruschke, Doing Bayesian Data Analysis, 2nd edn. (Elsevier, Amsterdam, 2015)

    MATH  Google Scholar 

  11. K.R. Murphy, B. Myors, A. Wolach, Statistical Power Analysis, 4th edn. (Routledge, New York, 2014)

    Book  Google Scholar 

  12. Y. Nagata, How to Understand Statistical Methods (in Japanese) (JUSE Press, Shibuya, 1996)

    Google Scholar 

  13. Y. Nagata, How to Design the Sample Size (in Japanese) (Asakura Shoten, Shinjuku, 2003)

    Google Scholar 

  14. S.E. Robertson, On document populations and measures of IR effectiveness, in Proceedings of ICTIR, Budapest, 2007, pp. 9–22

    Google Scholar 

  15. S.E. Robertson, E. Kanoulas, On per-topic variance in IR evaluation, in Proceedings of ACM SIGIR, Portland, 2012, pp. 891–900

    Google Scholar 

  16. T. Sakai, Topic set size design. Inf. Retr. 19(3), 256–283 (2016)

    Article  Google Scholar 

  17. T. Sakai, The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation, in Proceedings of ACM SIGIR, Shinjuku, 2017, pp. 25–34

    Google Scholar 

  18. H. Toyoda, (ed.), Fundamentals of Bayesian Statistics: Practical Getting Started by Hamiltonian Monte Carlo Method (in Japanese) (Asakura Shoten, Shinjuku, 2015)

    Google Scholar 

  19. H. Toyoda. An Introduction to Statistical Data Analysis: Bayesian Statistics for ‘post p-value era’ (in Japanese) (Asakuha Shoten, Shinjuku, 2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sakai, T. (2018). Preliminaries. In: Laboratory Experiments in Information Retrieval. The Information Retrieval Series, vol 40. Springer, Singapore. https://doi.org/10.1007/978-981-13-1199-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-1199-4_1

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-1198-7

  • Online ISBN: 978-981-13-1199-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics