Topic Set Size Design Using Excel

Sakai, Tetsuya

doi:10.1007/978-981-13-1199-4_6

Tetsuya Sakai⁴

Part of the book series: The Information Retrieval Series ((INRE,volume 40))

1040 Accesses

Abstract

This chapter discusses topic set size design, which enables test collection builders to determine the number of topics to create based on statistical requirements. First, an overview of five topic set size design methods is provided (Sect. 6.1), followed by details on each method (Sects. 6.2, 6.3, 6.4, 6.5, and 6.6). These methods are based on a desired statistical power (for the paired t-test, the two-sample t-test, and one-way ANOVA) or on a desired cap on the expected width of the confidence interval of the difference in means for paired and unpaired data. The simple Excel tools that I devised are based on the sample size design techniques as described in Nagata Y (How to design the sample size (in Japanese). Asakura Shoten, 2003). As these methods require an estimate of the population within-system variance for a given evaluation measure (or the variance of the score differences in the case of paired data), this chapter then describes how the variance can be estimated from pilot data (Sect. 6.7). Finally, it discusses the relationship across the different topic set size design methods (Sect. 6.8).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This chapter relies heavily on Nagata’s formula derivations for sample size design [16], but the book is in Japanese. For discussions in English on sample sizes power analysis, the reader is referred to Ryan [18], Murphy, Myors, and Wolach [15], and Kraemer and Blasey [14].
2.
Gilbert and Sparck Jones [11] (page A4) do report on a table that shows the required number of topics as a function of the number of relevant or retrieved documents per topic. For example, if the number of relevant documents per topic is five and we want 5% Type I error probability and 95% statistical power with the sign test, 830 topics are required according to their analysis.
3.
Precision at document cuttoff 10.
4.
http://www.ccs.neu.edu/home/jaa/papers/drafts/statAP.pdf
5.
These tools are slightly easier to use than their earlier versions, samplesizeTTEST.xlsx, samplesizeANOVA.xlsx, and samplesizeCI.xlsx, in that there is no need for the user to scroll down the Excel sheet to find the right topic set size anymore.
6.
The achieved power is computed in Column K, although not shown in Fig. 6.1.
7.
In Corollary 9, let $\mu = \mu _{1}-\mu _{2}, \sigma ^2 = \sigma _{1}^2 + \sigma _{2}^2, \mu _{0}=0, \lambda = \lambda _{t}$.
8.
Recall that with Microsoft Excel, z _inv(P) can be obtained as NORM.S.INV(1 − P).
9.
This table corrects a typo in Table 1 of Sakai [22] for (α, β, minΔ _t) = (0.05, 0.20, 1.0), and provides the sample sizes for minΔ _t = 1.5, 2.0 in addition.
10.
The achieved power is computed in Column K, although not shown in Fig. 6.2.
11.
An earlier version of this tool, samplesizeANOVA, accommodates only α = 0.01, 0.05 and β = 0.10, 0.20 [22].
12.
The achieved power is computed in Column I, although not shown in Fig. 6.3.
13.
Let A =max_ia _i and a =min_ia _i. Then $D^2/2=(A^2+a^2-2Aa)/2 \leq A^2 + a^2 \leq \sum _{i=1}^{m} a_{i}^2$. The equality holds when A = D∕2, a = −D∕2 and a _i = 0 for all other systems.
14.
Let χ ² be a random variable that obeys χ ²(ϕ). Then c ^∗ represents the population mean of the random variable $\sqrt {\chi ^2/\phi }$. That is, $E(\sqrt {\chi ^2/\phi })=c^{\ast }$. This is the same c ^∗ used in Theorem 11 (Chap. 1 Sect. 1.3.1).
15.
Recall Corollary 5 (Chap. 1 Sect. 1.2.4): if $u = \frac {\bar {x}-\mu }{\sqrt {\sigma ^2/n}} \sim N(0,1^2)$, then $t = \frac {\bar {x}-\mu }{\sqrt {V/n}} \sim t(n-1)$ where E(V ) = σ ². That is, a t-distribution is like the standard normal distribution, except that there is an uncertainty about the estimator of σ ², whose accuracy increases with n.
16.
The covariance of two random variables x and y is defined as COV(x, y) = E((x − E(x))(y − (y))); note that COV(x, x) = V (x), i.e. the population variance of x (see Chap. 1 Sect. 1.2.1). Now, in general, V (x − y) = V (x) + V (y) − 2COV(x, y) holds. However, if COV(x, y) = 0, we say that x and y are uncorrelated.
17.
http://research.nii.ac.jp/ntcir/index-en.html
18.
The high variances of nERR reflect the fact that it is a measure designed primarily for navigational intents. That is, this measure relies heavily on the first retrieved relevant document, while the other measures rely on the other retrieved relevant documents as well.
19.
Start from the left hand side of Eq. 6.61.
$$\displaystyle \begin{aligned} n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - 2\bar{x}(n_{1}\bar{x}_{1\bullet} + n_{2}\bar{x}_{2\bullet}) + (n_{1}+n_{2})\bar{x}^2 = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - 2N\bar{x}^2 + N\bar{x}^2 \end{aligned}$$

$$\displaystyle \begin{aligned} = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - N \frac{ (n_{1}\bar{x}_{1\bullet} + n_{2}\bar{x}_{2\bullet} )^2}{N^2} = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - \frac{ n_{1}^2\bar{x}_{1\bullet}^2 + n_{2}^2\bar{x}_{2\bullet}^2 + 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet} }{N} \end{aligned}$$

$$\displaystyle \begin{aligned} = \frac{1}{N} ( (n_{1}+n_{2})n_{1}\bar{x}_{1\bullet}^2 + (n_{1}+n_{2})n_{2}\bar{x}_{2\bullet}^2 - n_{1}^2\bar{x}_{1\bullet}^2 - n_{2}^2\bar{x}_{2\bullet}^2 - 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet}) \end{aligned}$$

$$\displaystyle \begin{aligned} = \frac{1}{N} (n_{1}n_{2}\bar{x}_{1\bullet}^2 + n_{1}n_{2}\bar{x}_{2\bullet}^2 - 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet}) =\frac{n_{1}n_{2}}{N} (\bar{x}_{1\bullet}-\bar{x}_{2\bullet})^2 \ , \end{aligned}$$

which equals the right hand side of Eq. 6.61.

References

J. Allan, B. Carterette, J.A. Aslam, V. Pavlu, B. Dachev, E. Kanoulas, Million query track 2007 overview, in Proceedings of TREC 2007, Gaithersburg, 2008
Google Scholar
J. Allan, J.A. Aslam, B. Carterette, V. Pavlu, E. Kanoulas, Million query track 2008 overview, in Proceedings of TREC 2008, Gaithersburg, 2009
Google Scholar
C. Buckley, E.M. Voorhees, Retrieval system evaluation, in TREC: Experiment and Evaluation in Information Retrieval, ed. by E.M. Voorhees, D.K. Harman, chapter 3, pp. 53–75 (The MIT Press, Cambridge, MA, 2005)
Google Scholar
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender, Learning to rank using gradient descent, in Proceedings of ACM ICML, Bonn, 2005, pp. 89–96
Google Scholar
B. Carterette, J. Allan, R. Sitaraman, Minimal test collections for retrieval evaluation, in Proceedings of ACM SIGIR, Seattles, 2006, pp. 268–275
Google Scholar
B. Carterette, V. Pavlu, E. Kanoulas, J.A. Aslam, J. Allan, Evaluation over thousands of queries, in Proceedings of ACM SIGIR, Singapore, 2008, pp. 651–658
Google Scholar
B. Carterette, V. Pavlu, H. Fang, E. Kanoulas, Million query track 2009 overview, in Proceedings of TREC 2009, Gaithersburg, 2010
Google Scholar
O. Chapelle, D. Metzler, Y. Zhang, P. Grinspan, Expected reciprocal rank for graded relevance, in Proceedings of ACM CIKM, Hong Kong, 2009, pp. 621–630
Google Scholar
C.L.A. Clarke, N. Craswell, I. Soboroff, E.M. Voorhees, Overview of the TREC 2011 web track, in Proceedings of TREC 2011, Gaithersburg, 2012
Google Scholar
C.L.A. Clarke, N. Craswell, E.M. Voorhees, Overview of the TREC 2012 web track, in Proceedings of TREC 2012, Gaithersburg, 2013
Google Scholar
H. Gilbert, K. Sparck Jones, Statistical bases of relevance assessment for the ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5481 (1979)
Google Scholar
D.K. Harman, The TREC test collections, in TREC: Experiment and Evaluation in Information Retrieval, ed. by E.M. Voorhees, D.K. Harman, chapter 2 (The MIT Press, Cambridge, MA, 2005)
Google Scholar
K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)
Article Google Scholar
H.C. Kraemer, C. Blasey, How Many Subjects? Statistical Power Analysis in Research, 2nd edn. (SAGE Publications, Los Angeles, 2016)
Book Google Scholar
K.R. Murphy, B. Myors, A. Wolach, Statistical Power Analysis: A Simple and General Model for Traditional and Modern Hypothesis Tests, 4th edn. (Routledge, London, 2014)
Book Google Scholar
Y. Nagata, How to Design the Sample Size (in Japanese) (Asakura Shoten, Shinjuku, 2003)
Google Scholar
Y. Nagata, M. Yoshida, Introduction to Multiple Comparison Procedures (in Japanese) (Scientist Press, Shibuya, 1997)
Google Scholar
T.P. Ryan, Sample Size Determination and Power (Wiley, Chichester, 2013)
Book Google Scholar
T. Sakai, Ranking the NTCIR systems based on multigrade relevance, in Proceedings of AIRS 2004, Beijing. LNCS 3411, 2004, pp. 251–262
Google Scholar
T. Sakai, Evaluating evaluation metrics based on the bootstrap, in Proceedings of ACM SIGIR, Seattle, 2006, pp. 525–532
Google Scholar
T. Sakai, Metrics, statistics, tests, in PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173), 2014, pp. 116–163
Google Scholar
T. Sakai, Topic set size design. Inf. Retr. 19(3), 256–283 (2016)
Article Google Scholar
T. Sakai, Evaluating evaluation measures with worst-case confidence interval widths, in Proceedings of EVIA, Chiyoda, 2017, pp. 16–19
Google Scholar
T. Sakai, How to run an evaluation task, in Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF, ed. by N. Ferro, C. Peters, chapter 3. (Springer, 2019)
Google Scholar
T. Sakai, L. Shang, On estimating variances for topic set size design, in Proceedings of EVIA, Chiyoda, 2016, pp. 9–12
Google Scholar
M. Sanderson, J. Zobel, Information retrieval evaluation: effort, sensitivity, and reliability, in Proceedings of ACM SIGIR, Salvador, 2005, pp. 162–169
Google Scholar
K. Sparck Jones, C.J. van Rijsbergen, Report on the need for and provision of an ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5266, 1975
Google Scholar
K. Sparck Jones, R.G. Bates, Report on a design study for the ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5481, 1977
Google Scholar
E.M. Voorhees, Overview of the TREC 2003 robust retrieval track, in Proceedings of TREC 2003, Gaithersburg, 2004
Google Scholar
E.M. Voorhees, Overview of the TREC 2004 robust retrieval track, in Proceedings of TREC 2004, Gaithersburg, 2005
Google Scholar
E.M. Voorhees, Topic set size redux, in Proceedings of ACM SIGIR, Boston, 2009, pp. 806–807
Google Scholar
E.M. Voorhees, C. Buckley, The effect of topic set sizes on retrieval experiment error, in Proceedings of ACM SIGIR, Tampere, 2002, pp. 162–169
Google Scholar
W. Webber, A. Moffat, J. Zobel, Statistical power in retrieval experimentation, in Proceedings of ACM CIKM, Napa Valley, 2008, pp. 571–580
Google Scholar
J. Zobel, How reliable are the results of large-scale information retrieval experiments? in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 307–314
Google Scholar

Download references

Author information

Authors and Affiliations

Waseda University, Tokyo, Japan
Tetsuya Sakai

Authors

Tetsuya Sakai
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sakai, T. (2018). Topic Set Size Design Using Excel. In: Laboratory Experiments in Information Retrieval. The Information Retrieval Series, vol 40. Springer, Singapore. https://doi.org/10.1007/978-981-13-1199-4_6

Download citation

DOI: https://doi.org/10.1007/978-981-13-1199-4_6
Published: 23 September 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1198-7
Online ISBN: 978-981-13-1199-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics