Abstract
This chapter discusses topic set size design, which enables test collection builders to determine the number of topics to create based on statistical requirements. First, an overview of five topic set size design methods is provided (Sect. 6.1), followed by details on each method (Sects. 6.2, 6.3, 6.4, 6.5, and 6.6). These methods are based on a desired statistical power (for the paired t-test, the two-sample t-test, and one-way ANOVA) or on a desired cap on the expected width of the confidence interval of the difference in means for paired and unpaired data. The simple Excel tools that I devised are based on the sample size design techniques as described in Nagata Y (How to design the sample size (in Japanese). Asakura Shoten, 2003). As these methods require an estimate of the population within-system variance for a given evaluation measure (or the variance of the score differences in the case of paired data), this chapter then describes how the variance can be estimated from pilot data (Sect. 6.7). Finally, it discusses the relationship across the different topic set size design methods (Sect. 6.8).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Gilbert and Sparck Jones [11] (page A4) do report on a table that shows the required number of topics as a function of the number of relevant or retrieved documents per topic. For example, if the number of relevant documents per topic is five and we want 5% Type I error probability and 95% statistical power with the sign test, 830 topics are required according to their analysis.
- 3.
Precision at document cuttoff 10.
- 4.
- 5.
These tools are slightly easier to use than their earlier versions, samplesizeTTEST.xlsx, samplesizeANOVA.xlsx, and samplesizeCI.xlsx, in that there is no need for the user to scroll down the Excel sheet to find the right topic set size anymore.
- 6.
The achieved power is computed in Column K, although not shown in Fig. 6.1.
- 7.
In Corollary 9, let \(\mu = \mu _{1}-\mu _{2}, \sigma ^2 = \sigma _{1}^2 + \sigma _{2}^2, \mu _{0}=0, \lambda = \lambda _{t}\).
- 8.
Recall that with Microsoft Excel, z inv(P) can be obtained as NORM.S.INV(1 − P).
- 9.
This table corrects a typo in Table 1 of Sakai [22] for (α, β, minΔ t) = (0.05, 0.20, 1.0), and provides the sample sizes for minΔ t = 1.5, 2.0 in addition.
- 10.
The achieved power is computed in Column K, although not shown in Fig. 6.2.
- 11.
An earlier version of this tool, samplesizeANOVA, accommodates only α = 0.01, 0.05 and β = 0.10, 0.20 [22].
- 12.
The achieved power is computed in Column I, although not shown in Fig. 6.3.
- 13.
Let A =maxia i and a =minia i. Then \(D^2/2=(A^2+a^2-2Aa)/2 \leq A^2 + a^2 \leq \sum _{i=1}^{m} a_{i}^2\). The equality holds when A = D∕2, a = −D∕2 and a i = 0 for all other systems.
- 14.
- 15.
Recall Corollary 5 (Chap. 1 Sect. 1.2.4): if \(u = \frac {\bar {x}-\mu }{\sqrt {\sigma ^2/n}} \sim N(0,1^2)\), then \(t = \frac {\bar {x}-\mu }{\sqrt {V/n}} \sim t(n-1)\) where E(V ) = σ 2. That is, a t-distribution is like the standard normal distribution, except that there is an uncertainty about the estimator of σ 2, whose accuracy increases with n.
- 16.
The covariance of two random variables x and y is defined as COV(x, y) = E((x − E(x))(y − (y))); note that COV(x, x) = V (x), i.e. the population variance of x (see Chap. 1 Sect. 1.2.1). Now, in general, V (x − y) = V (x) + V (y) − 2COV(x, y) holds. However, if COV(x, y) = 0, we say that x and y are uncorrelated.
- 17.
- 18.
The high variances of nERR reflect the fact that it is a measure designed primarily for navigational intents. That is, this measure relies heavily on the first retrieved relevant document, while the other measures rely on the other retrieved relevant documents as well.
- 19.
Start from the left hand side of Eq. 6.61.
$$\displaystyle \begin{aligned} n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - 2\bar{x}(n_{1}\bar{x}_{1\bullet} + n_{2}\bar{x}_{2\bullet}) + (n_{1}+n_{2})\bar{x}^2 = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - 2N\bar{x}^2 + N\bar{x}^2 \end{aligned}$$$$\displaystyle \begin{aligned} = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - N \frac{ (n_{1}\bar{x}_{1\bullet} + n_{2}\bar{x}_{2\bullet} )^2}{N^2} = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - \frac{ n_{1}^2\bar{x}_{1\bullet}^2 + n_{2}^2\bar{x}_{2\bullet}^2 + 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet} }{N} \end{aligned}$$$$\displaystyle \begin{aligned} = \frac{1}{N} ( (n_{1}+n_{2})n_{1}\bar{x}_{1\bullet}^2 + (n_{1}+n_{2})n_{2}\bar{x}_{2\bullet}^2 - n_{1}^2\bar{x}_{1\bullet}^2 - n_{2}^2\bar{x}_{2\bullet}^2 - 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet}) \end{aligned}$$$$\displaystyle \begin{aligned} = \frac{1}{N} (n_{1}n_{2}\bar{x}_{1\bullet}^2 + n_{1}n_{2}\bar{x}_{2\bullet}^2 - 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet}) =\frac{n_{1}n_{2}}{N} (\bar{x}_{1\bullet}-\bar{x}_{2\bullet})^2 \ , \end{aligned}$$which equals the right hand side of Eq. 6.61.
References
J. Allan, B. Carterette, J.A. Aslam, V. Pavlu, B. Dachev, E. Kanoulas, Million query track 2007 overview, in Proceedings of TREC 2007, Gaithersburg, 2008
J. Allan, J.A. Aslam, B. Carterette, V. Pavlu, E. Kanoulas, Million query track 2008 overview, in Proceedings of TREC 2008, Gaithersburg, 2009
C. Buckley, E.M. Voorhees, Retrieval system evaluation, in TREC: Experiment and Evaluation in Information Retrieval, ed. by E.M. Voorhees, D.K. Harman, chapter 3, pp. 53–75 (The MIT Press, Cambridge, MA, 2005)
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender, Learning to rank using gradient descent, in Proceedings of ACM ICML, Bonn, 2005, pp. 89–96
B. Carterette, J. Allan, R. Sitaraman, Minimal test collections for retrieval evaluation, in Proceedings of ACM SIGIR, Seattles, 2006, pp. 268–275
B. Carterette, V. Pavlu, E. Kanoulas, J.A. Aslam, J. Allan, Evaluation over thousands of queries, in Proceedings of ACM SIGIR, Singapore, 2008, pp. 651–658
B. Carterette, V. Pavlu, H. Fang, E. Kanoulas, Million query track 2009 overview, in Proceedings of TREC 2009, Gaithersburg, 2010
O. Chapelle, D. Metzler, Y. Zhang, P. Grinspan, Expected reciprocal rank for graded relevance, in Proceedings of ACM CIKM, Hong Kong, 2009, pp. 621–630
C.L.A. Clarke, N. Craswell, I. Soboroff, E.M. Voorhees, Overview of the TREC 2011 web track, in Proceedings of TREC 2011, Gaithersburg, 2012
C.L.A. Clarke, N. Craswell, E.M. Voorhees, Overview of the TREC 2012 web track, in Proceedings of TREC 2012, Gaithersburg, 2013
H. Gilbert, K. Sparck Jones, Statistical bases of relevance assessment for the ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5481 (1979)
D.K. Harman, The TREC test collections, in TREC: Experiment and Evaluation in Information Retrieval, ed. by E.M. Voorhees, D.K. Harman, chapter 2 (The MIT Press, Cambridge, MA, 2005)
K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)
H.C. Kraemer, C. Blasey, How Many Subjects? Statistical Power Analysis in Research, 2nd edn. (SAGE Publications, Los Angeles, 2016)
K.R. Murphy, B. Myors, A. Wolach, Statistical Power Analysis: A Simple and General Model for Traditional and Modern Hypothesis Tests, 4th edn. (Routledge, London, 2014)
Y. Nagata, How to Design the Sample Size (in Japanese) (Asakura Shoten, Shinjuku, 2003)
Y. Nagata, M. Yoshida, Introduction to Multiple Comparison Procedures (in Japanese) (Scientist Press, Shibuya, 1997)
T.P. Ryan, Sample Size Determination and Power (Wiley, Chichester, 2013)
T. Sakai, Ranking the NTCIR systems based on multigrade relevance, in Proceedings of AIRS 2004, Beijing. LNCS 3411, 2004, pp. 251–262
T. Sakai, Evaluating evaluation metrics based on the bootstrap, in Proceedings of ACM SIGIR, Seattle, 2006, pp. 525–532
T. Sakai, Metrics, statistics, tests, in PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173), 2014, pp. 116–163
T. Sakai, Topic set size design. Inf. Retr. 19(3), 256–283 (2016)
T. Sakai, Evaluating evaluation measures with worst-case confidence interval widths, in Proceedings of EVIA, Chiyoda, 2017, pp. 16–19
T. Sakai, How to run an evaluation task, in Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF, ed. by N. Ferro, C. Peters, chapter 3. (Springer, 2019)
T. Sakai, L. Shang, On estimating variances for topic set size design, in Proceedings of EVIA, Chiyoda, 2016, pp. 9–12
M. Sanderson, J. Zobel, Information retrieval evaluation: effort, sensitivity, and reliability, in Proceedings of ACM SIGIR, Salvador, 2005, pp. 162–169
K. Sparck Jones, C.J. van Rijsbergen, Report on the need for and provision of an ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5266, 1975
K. Sparck Jones, R.G. Bates, Report on a design study for the ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5481, 1977
E.M. Voorhees, Overview of the TREC 2003 robust retrieval track, in Proceedings of TREC 2003, Gaithersburg, 2004
E.M. Voorhees, Overview of the TREC 2004 robust retrieval track, in Proceedings of TREC 2004, Gaithersburg, 2005
E.M. Voorhees, Topic set size redux, in Proceedings of ACM SIGIR, Boston, 2009, pp. 806–807
E.M. Voorhees, C. Buckley, The effect of topic set sizes on retrieval experiment error, in Proceedings of ACM SIGIR, Tampere, 2002, pp. 162–169
W. Webber, A. Moffat, J. Zobel, Statistical power in retrieval experimentation, in Proceedings of ACM CIKM, Napa Valley, 2008, pp. 571–580
J. Zobel, How reliable are the results of large-scale information retrieval experiments? in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 307–314
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Sakai, T. (2018). Topic Set Size Design Using Excel. In: Laboratory Experiments in Information Retrieval. The Information Retrieval Series, vol 40. Springer, Singapore. https://doi.org/10.1007/978-981-13-1199-4_6
Download citation
DOI: https://doi.org/10.1007/978-981-13-1199-4_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1198-7
Online ISBN: 978-981-13-1199-4
eBook Packages: Computer ScienceComputer Science (R0)