Skip to main content

The Reusability of a Diversified Search Test Collection

  • Conference paper
Information Retrieval Technology (AIRS 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7675))

Included in the following conference series:

Abstract

Traditional ad hoc IR test collections were built using a relatively large pool depth (e.g. 100), and are usually assumed to be reusable. Moreover, when they are reused to compare a new system with another or with systems that contributed to the pools (“contributors”), an even larger measurement depth (e.g. 1,000) is often used for computing evaluation metrics. In contrast, the web diversity test collections that have been created in the past few years at TREC and NTCIR use a much smaller pool depth (e.g. 20). The measurement depth is also small (e.g. 10-30), as search result diversification is primarily intended for the first result page. In this study, we examine the reusability of a typical web diversity test collection, namely, one from the NTCIR-9 INTENT-1 Chinese Document Ranking task, which used a pool depth of 20 and official measurement depths of 10, 20 and 30. First, we conducted additional relevance assessments to expand the official INTENT-1 collection to achieve a pool depth of 40. Using the expanded relevance assessments, we show that run rankings at the measurement depth of 30 are too unreliable, given that the pool depth is 20. Second, we conduct a leave-one-out experiment for every participating team of the INTENT-1 Chinese task, to examine how (un)fairly new runs are evaluated with the INTENT-1 collection. We show that, for the purpose of comparing new systems with the contributors of the test collection being used, condensed-list versions of existing diversity evaluation metrics are more reliable than the raw metrics. However, even the condensed-list metrics may be unreliable if the new systems are not competitive compared to the contributors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Buckley, C., Dimmick, D., Soboroff, I., Voorhees, E.M.: Bias and the limits of pooling for large collections. Information Retrieval 10(6), 491–508 (2007)

    Article  Google Scholar 

  2. Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of ACM SIGIR 2004, pp. 25–32 (2004)

    Google Scholar 

  3. Büttcher, S., Clarke, C.L., Yeung, P.C., Soboroff, I.: Reliable information retrieval evaluation with incomplete and biased judgements. In: ACM SIGIR 2007 Proceedings, pp. 63–70 (2007)

    Google Scholar 

  4. Carterette, B.: Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM TOIS 30(1) (2012)

    Google Scholar 

  5. Carterette, B., Gabrilovich, E., Josifovski, V., Metzler, D.: Measuring the reusability of test collections. In: Proceedings of WSDM 2012, pp. 231–240 (2010)

    Google Scholar 

  6. Carterette, B., Kanoulas, E., Pavlu, V., Fang, H.: Reusable test collections through experimental design. In: Proceedings of ACM SIGIR 2010, pp. 547–554 (2010)

    Google Scholar 

  7. Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J.A., Allan, J.: Evaluation over thousands of queries. In: Proceedings of ACM SIGIR 2008, pp. 651–658 (2008)

    Google Scholar 

  8. Clarke, C.L., Craswell, N., Soboroff, I.: Overview of the TREC 2009 web track. In: Proceedings of TREC 2009 (2009)

    Google Scholar 

  9. Clarke, C.L., Craswell, N., Soboroff, I., Voorhees, E.: Overview of the TREC 2011 web track. In: Proceedings of TREC 2011 (2012)

    Google Scholar 

  10. Sakai, T.: Alternatives to bpref. In: Proceedings of ACM SIGIR 2007, pp. 71–78 (2007)

    Google Scholar 

  11. Sakai, T.: Comparing metrics across TREC and NTCIR: The robustness to pool depth bias. In: Proceedings of ACM SIGIR 2008, pp. 691–692 (2008)

    Google Scholar 

  12. Sakai, T.: Comparing metrics across TREC and NTCIR: The robustness to system bias. In: Proceedings of ACM CIKM 2008, pp. 581–590 (2008)

    Google Scholar 

  13. Sakai, T., Kando, N.: Are popular documents more likely to be relevant? a dive into the ACLIA IR4QA pools. In: Proceedings of EVIA 2008, pp. 8–9 (2008)

    Google Scholar 

  14. Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval 11, 447–470 (2008)

    Article  Google Scholar 

  15. Sakai, T., Song, R.: Diversified search evaluation: Lessons from the NTCIR-9 INTENT task. Information Retrieval (to appear)

    Google Scholar 

  16. Sakai, T., Song, R.: Evaluating diversified search results using per-intent graded relevance. In: Proceedings of ACM SIGIR 2011 (2011)

    Google Scholar 

  17. Song, R., Zhang, M., Sakai, T., Kato, M.P., Liu, Y., Sugimoto, M., Wang, Q., Orii, N.: Overview of the NTCIR-9 INTENT task. In: Proceedings of NTCIR-9, pp. 82–105 (2011)

    Google Scholar 

  18. Voorhees, E.M.: The Philosophy of Information Retrieval Evaluation. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 355–370. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  19. Webber, W., Park, L.A.: Score adjustment for correction of pooling bias. In: Proceedings of ACM SIGIR 2009, pp. 444–451 (2009)

    Google Scholar 

  20. Yilmaz, E., Aslam, J., Robertson, S.: A new rank correlation coefficient for information retrieval. In: Proceedings of ACM SIGIR 2008, pp. 587–594 (2008)

    Google Scholar 

  21. Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. In: ACM CIKM 2006 Proceedings, pp. 102–111 (2006)

    Google Scholar 

  22. Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proceedings of ACM SIGIR 1998, pp. 307–314 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sakai, T., Dou, Z., Song, R., Kando, N. (2012). The Reusability of a Diversified Search Test Collection. In: Hou, Y., Nie, JY., Sun, L., Wang, B., Zhang, P. (eds) Information Retrieval Technology. AIRS 2012. Lecture Notes in Computer Science, vol 7675. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35341-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35341-3_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35340-6

  • Online ISBN: 978-3-642-35341-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics