Skip to main content

Topic Set Size Design Using Excel

  • Chapter
  • First Online:
Laboratory Experiments in Information Retrieval

Part of the book series: The Information Retrieval Series ((INRE,volume 40))

  • 1040 Accesses

Abstract

This chapter discusses topic set size design, which enables test collection builders to determine the number of topics to create based on statistical requirements. First, an overview of five topic set size design methods is provided (Sect. 6.1), followed by details on each method (Sects. 6.2, 6.3, 6.4, 6.5, and 6.6). These methods are based on a desired statistical power (for the paired t-test, the two-sample t-test, and one-way ANOVA) or on a desired cap on the expected width of the confidence interval of the difference in means for paired and unpaired data. The simple Excel tools that I devised are based on the sample size design techniques as described in Nagata Y (How to design the sample size (in Japanese). Asakura Shoten, 2003). As these methods require an estimate of the population within-system variance for a given evaluation measure (or the variance of the score differences in the case of paired data), this chapter then describes how the variance can be estimated from pilot data (Sect. 6.7). Finally, it discusses the relationship across the different topic set size design methods (Sect. 6.8).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This chapter relies heavily on Nagata’s formula derivations for sample size design [16], but the book is in Japanese. For discussions in English on sample sizes power analysis, the reader is referred to Ryan [18], Murphy, Myors, and Wolach [15], and Kraemer and Blasey [14].

  2. 2.

    Gilbert and Sparck Jones [11] (page A4) do report on a table that shows the required number of topics as a function of the number of relevant or retrieved documents per topic. For example, if the number of relevant documents per topic is five and we want 5% Type I error probability and 95% statistical power with the sign test, 830 topics are required according to their analysis.

  3. 3.

    Precision at document cuttoff 10.

  4. 4.

    http://www.ccs.neu.edu/home/jaa/papers/drafts/statAP.pdf

  5. 5.

    These tools are slightly easier to use than their earlier versions, samplesizeTTEST.xlsx, samplesizeANOVA.xlsx, and samplesizeCI.xlsx, in that there is no need for the user to scroll down the Excel sheet to find the right topic set size anymore.

  6. 6.

    The achieved power is computed in Column K, although not shown in Fig. 6.1.

  7. 7.

    In Corollary 9, let \(\mu = \mu _{1}-\mu _{2}, \sigma ^2 = \sigma _{1}^2 + \sigma _{2}^2, \mu _{0}=0, \lambda = \lambda _{t}\).

  8. 8.

    Recall that with Microsoft Excel, z inv(P) can be obtained as NORM.S.INV(1 − P).

  9. 9.

    This table corrects a typo in Table 1 of Sakai [22] for (α, β, minΔ t) = (0.05, 0.20, 1.0), and provides the sample sizes for minΔ t = 1.5, 2.0 in addition.

  10. 10.

    The achieved power is computed in Column K, although not shown in Fig. 6.2.

  11. 11.

    An earlier version of this tool, samplesizeANOVA, accommodates only α = 0.01, 0.05 and β = 0.10, 0.20 [22].

  12. 12.

    The achieved power is computed in Column I, although not shown in Fig. 6.3.

  13. 13.

    Let A =maxia i and a =minia i. Then \(D^2/2=(A^2+a^2-2Aa)/2 \leq A^2 + a^2 \leq \sum _{i=1}^{m} a_{i}^2\). The equality holds when A = D∕2, a = −D∕2 and a i = 0 for all other systems.

  14. 14.

    Let χ 2 be a random variable that obeys χ 2(ϕ). Then c ∗ represents the population mean of the random variable \(\sqrt {\chi ^2/\phi }\). That is, \(E(\sqrt {\chi ^2/\phi })=c^{\ast }\). This is the same c ∗ used in Theorem 11 (Chap. 1 Sect. 1.3.1).

  15. 15.

    Recall Corollary 5 (Chap. 1 Sect. 1.2.4): if \(u = \frac {\bar {x}-\mu }{\sqrt {\sigma ^2/n}} \sim N(0,1^2)\), then \(t = \frac {\bar {x}-\mu }{\sqrt {V/n}} \sim t(n-1)\) where E(V ) = σ 2. That is, a t-distribution is like the standard normal distribution, except that there is an uncertainty about the estimator of σ 2, whose accuracy increases with n.

  16. 16.

    The covariance of two random variables x and y is defined as COV(x, y) = E((x − E(x))(y − (y))); note that COV(x, x) = V (x), i.e. the population variance of x (see Chap. 1 Sect. 1.2.1). Now, in general, V (x − y) = V (x) + V (y) − 2COV(x, y) holds. However, if COV(x, y) = 0, we say that x and y are uncorrelated.

  17. 17.

    http://research.nii.ac.jp/ntcir/index-en.html

  18. 18.

    The high variances of nERR reflect the fact that it is a measure designed primarily for navigational intents. That is, this measure relies heavily on the first retrieved relevant document, while the other measures rely on the other retrieved relevant documents as well.

  19. 19.

    Start from the left hand side of Eq. 6.61.

    $$\displaystyle \begin{aligned} n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - 2\bar{x}(n_{1}\bar{x}_{1\bullet} + n_{2}\bar{x}_{2\bullet}) + (n_{1}+n_{2})\bar{x}^2 = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - 2N\bar{x}^2 + N\bar{x}^2 \end{aligned}$$
    $$\displaystyle \begin{aligned} = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - N \frac{ (n_{1}\bar{x}_{1\bullet} + n_{2}\bar{x}_{2\bullet} )^2}{N^2} = n_{1}\bar{x}_{1\bullet}^2 + n_{2}\bar{x}_{2\bullet}^2 - \frac{ n_{1}^2\bar{x}_{1\bullet}^2 + n_{2}^2\bar{x}_{2\bullet}^2 + 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet} }{N} \end{aligned}$$
    $$\displaystyle \begin{aligned} = \frac{1}{N} ( (n_{1}+n_{2})n_{1}\bar{x}_{1\bullet}^2 + (n_{1}+n_{2})n_{2}\bar{x}_{2\bullet}^2 - n_{1}^2\bar{x}_{1\bullet}^2 - n_{2}^2\bar{x}_{2\bullet}^2 - 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet}) \end{aligned}$$
    $$\displaystyle \begin{aligned} = \frac{1}{N} (n_{1}n_{2}\bar{x}_{1\bullet}^2 + n_{1}n_{2}\bar{x}_{2\bullet}^2 - 2n_{1}n_{2}\bar{x}_{1\bullet}\bar{x}_{2\bullet}) =\frac{n_{1}n_{2}}{N} (\bar{x}_{1\bullet}-\bar{x}_{2\bullet})^2 \ , \end{aligned}$$

    which equals the right hand side of Eq. 6.61.

References

  1. J. Allan, B. Carterette, J.A. Aslam, V. Pavlu, B. Dachev, E. Kanoulas, Million query track 2007 overview, in Proceedings of TREC 2007, Gaithersburg, 2008

    Google Scholar 

  2. J. Allan, J.A. Aslam, B. Carterette, V. Pavlu, E. Kanoulas, Million query track 2008 overview, in Proceedings of TREC 2008, Gaithersburg, 2009

    Google Scholar 

  3. C. Buckley, E.M. Voorhees, Retrieval system evaluation, in TREC: Experiment and Evaluation in Information Retrieval, ed. by E.M. Voorhees, D.K. Harman, chapter 3, pp. 53–75 (The MIT Press, Cambridge, MA, 2005)

    Google Scholar 

  4. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender, Learning to rank using gradient descent, in Proceedings of ACM ICML, Bonn, 2005, pp. 89–96

    Google Scholar 

  5. B. Carterette, J. Allan, R. Sitaraman, Minimal test collections for retrieval evaluation, in Proceedings of ACM SIGIR, Seattles, 2006, pp. 268–275

    Google Scholar 

  6. B. Carterette, V. Pavlu, E. Kanoulas, J.A. Aslam, J. Allan, Evaluation over thousands of queries, in Proceedings of ACM SIGIR, Singapore, 2008, pp. 651–658

    Google Scholar 

  7. B. Carterette, V. Pavlu, H. Fang, E. Kanoulas, Million query track 2009 overview, in Proceedings of TREC 2009, Gaithersburg, 2010

    Google Scholar 

  8. O. Chapelle, D. Metzler, Y. Zhang, P. Grinspan, Expected reciprocal rank for graded relevance, in Proceedings of ACM CIKM, Hong Kong, 2009, pp. 621–630

    Google Scholar 

  9. C.L.A. Clarke, N. Craswell, I. Soboroff, E.M. Voorhees, Overview of the TREC 2011 web track, in Proceedings of TREC 2011, Gaithersburg, 2012

    Google Scholar 

  10. C.L.A. Clarke, N. Craswell, E.M. Voorhees, Overview of the TREC 2012 web track, in Proceedings of TREC 2012, Gaithersburg, 2013

    Google Scholar 

  11. H. Gilbert, K. Sparck Jones, Statistical bases of relevance assessment for the ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5481 (1979)

    Google Scholar 

  12. D.K. Harman, The TREC test collections, in TREC: Experiment and Evaluation in Information Retrieval, ed. by E.M. Voorhees, D.K. Harman, chapter 2 (The MIT Press, Cambridge, MA, 2005)

    Google Scholar 

  13. K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)

    Article  Google Scholar 

  14. H.C. Kraemer, C. Blasey, How Many Subjects? Statistical Power Analysis in Research, 2nd edn. (SAGE Publications, Los Angeles, 2016)

    Book  Google Scholar 

  15. K.R. Murphy, B. Myors, A. Wolach, Statistical Power Analysis: A Simple and General Model for Traditional and Modern Hypothesis Tests, 4th edn. (Routledge, London, 2014)

    Book  Google Scholar 

  16. Y. Nagata, How to Design the Sample Size (in Japanese) (Asakura Shoten, Shinjuku, 2003)

    Google Scholar 

  17. Y. Nagata, M. Yoshida, Introduction to Multiple Comparison Procedures (in Japanese) (Scientist Press, Shibuya, 1997)

    Google Scholar 

  18. T.P. Ryan, Sample Size Determination and Power (Wiley, Chichester, 2013)

    Book  Google Scholar 

  19. T. Sakai, Ranking the NTCIR systems based on multigrade relevance, in Proceedings of AIRS 2004, Beijing. LNCS 3411, 2004, pp. 251–262

    Google Scholar 

  20. T. Sakai, Evaluating evaluation metrics based on the bootstrap, in Proceedings of ACM SIGIR, Seattle, 2006, pp. 525–532

    Google Scholar 

  21. T. Sakai, Metrics, statistics, tests, in PROMISE Winter School 2013: Bridging between Information Retrieval and Databases (LNCS 8173), 2014, pp. 116–163

    Google Scholar 

  22. T. Sakai, Topic set size design. Inf. Retr. 19(3), 256–283 (2016)

    Article  Google Scholar 

  23. T. Sakai, Evaluating evaluation measures with worst-case confidence interval widths, in Proceedings of EVIA, Chiyoda, 2017, pp. 16–19

    Google Scholar 

  24. T. Sakai, How to run an evaluation task, in Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF, ed. by N. Ferro, C. Peters, chapter 3. (Springer, 2019)

    Google Scholar 

  25. T. Sakai, L. Shang, On estimating variances for topic set size design, in Proceedings of EVIA, Chiyoda, 2016, pp. 9–12

    Google Scholar 

  26. M. Sanderson, J. Zobel, Information retrieval evaluation: effort, sensitivity, and reliability, in Proceedings of ACM SIGIR, Salvador, 2005, pp. 162–169

    Google Scholar 

  27. K. Sparck Jones, C.J. van Rijsbergen, Report on the need for and provision of an ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5266, 1975

    Google Scholar 

  28. K. Sparck Jones, R.G. Bates, Report on a design study for the ‘ideal’ information retrieval test collection. Technical report, Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5481, 1977

    Google Scholar 

  29. E.M. Voorhees, Overview of the TREC 2003 robust retrieval track, in Proceedings of TREC 2003, Gaithersburg, 2004

    Google Scholar 

  30. E.M. Voorhees, Overview of the TREC 2004 robust retrieval track, in Proceedings of TREC 2004, Gaithersburg, 2005

    Google Scholar 

  31. E.M. Voorhees, Topic set size redux, in Proceedings of ACM SIGIR, Boston, 2009, pp. 806–807

    Google Scholar 

  32. E.M. Voorhees, C. Buckley, The effect of topic set sizes on retrieval experiment error, in Proceedings of ACM SIGIR, Tampere, 2002, pp. 162–169

    Google Scholar 

  33. W. Webber, A. Moffat, J. Zobel, Statistical power in retrieval experimentation, in Proceedings of ACM CIKM, Napa Valley, 2008, pp. 571–580

    Google Scholar 

  34. J. Zobel, How reliable are the results of large-scale information retrieval experiments? in Proceedings of ACM SIGIR, Melbourne, 1998, pp. 307–314

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sakai, T. (2018). Topic Set Size Design Using Excel. In: Laboratory Experiments in Information Retrieval. The Information Retrieval Series, vol 40. Springer, Singapore. https://doi.org/10.1007/978-981-13-1199-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-1199-4_6

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-1198-7

  • Online ISBN: 978-981-13-1199-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics