The Reusability of a Diversified Search Test Collection

Sakai, Tetsuya; Dou, Zhicheng; Song, Ruihua; Kando, Noriko

doi:10.1007/978-3-642-35341-3_3

Tetsuya Sakai²¹,
Zhicheng Dou²¹,
Ruihua Song²¹ &
…
Noriko Kando²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7675))

Included in the following conference series:

Asia Information Retrieval Symposium

1280 Accesses
5 Citations

Abstract

Traditional ad hoc IR test collections were built using a relatively large pool depth (e.g. 100), and are usually assumed to be reusable. Moreover, when they are reused to compare a new system with another or with systems that contributed to the pools (“contributors”), an even larger measurement depth (e.g. 1,000) is often used for computing evaluation metrics. In contrast, the web diversity test collections that have been created in the past few years at TREC and NTCIR use a much smaller pool depth (e.g. 20). The measurement depth is also small (e.g. 10-30), as search result diversification is primarily intended for the first result page. In this study, we examine the reusability of a typical web diversity test collection, namely, one from the NTCIR-9 INTENT-1 Chinese Document Ranking task, which used a pool depth of 20 and official measurement depths of 10, 20 and 30. First, we conducted additional relevance assessments to expand the official INTENT-1 collection to achieve a pool depth of 40. Using the expanded relevance assessments, we show that run rankings at the measurement depth of 30 are too unreliable, given that the pool depth is 20. Second, we conduct a leave-one-out experiment for every participating team of the INTENT-1 Chinese task, to examine how (un)fairly new runs are evaluated with the INTENT-1 collection. We show that, for the purpose of comparing new systems with the contributors of the test collection being used, condensed-list versions of existing diversity evaluation metrics are more reliable than the raw metrics. However, even the condensed-list metrics may be unreliable if the new systems are not competitive compared to the contributors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Buckley, C., Dimmick, D., Soboroff, I., Voorhees, E.M.: Bias and the limits of pooling for large collections. Information Retrieval 10(6), 491–508 (2007)
Article Google Scholar
Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of ACM SIGIR 2004, pp. 25–32 (2004)
Google Scholar
Büttcher, S., Clarke, C.L., Yeung, P.C., Soboroff, I.: Reliable information retrieval evaluation with incomplete and biased judgements. In: ACM SIGIR 2007 Proceedings, pp. 63–70 (2007)
Google Scholar
Carterette, B.: Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM TOIS 30(1) (2012)
Google Scholar
Carterette, B., Gabrilovich, E., Josifovski, V., Metzler, D.: Measuring the reusability of test collections. In: Proceedings of WSDM 2012, pp. 231–240 (2010)
Google Scholar
Carterette, B., Kanoulas, E., Pavlu, V., Fang, H.: Reusable test collections through experimental design. In: Proceedings of ACM SIGIR 2010, pp. 547–554 (2010)
Google Scholar
Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J.A., Allan, J.: Evaluation over thousands of queries. In: Proceedings of ACM SIGIR 2008, pp. 651–658 (2008)
Google Scholar
Clarke, C.L., Craswell, N., Soboroff, I.: Overview of the TREC 2009 web track. In: Proceedings of TREC 2009 (2009)
Google Scholar
Clarke, C.L., Craswell, N., Soboroff, I., Voorhees, E.: Overview of the TREC 2011 web track. In: Proceedings of TREC 2011 (2012)
Google Scholar
Sakai, T.: Alternatives to bpref. In: Proceedings of ACM SIGIR 2007, pp. 71–78 (2007)
Google Scholar
Sakai, T.: Comparing metrics across TREC and NTCIR: The robustness to pool depth bias. In: Proceedings of ACM SIGIR 2008, pp. 691–692 (2008)
Google Scholar
Sakai, T.: Comparing metrics across TREC and NTCIR: The robustness to system bias. In: Proceedings of ACM CIKM 2008, pp. 581–590 (2008)
Google Scholar
Sakai, T., Kando, N.: Are popular documents more likely to be relevant? a dive into the ACLIA IR4QA pools. In: Proceedings of EVIA 2008, pp. 8–9 (2008)
Google Scholar
Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval 11, 447–470 (2008)
Article Google Scholar
Sakai, T., Song, R.: Diversified search evaluation: Lessons from the NTCIR-9 INTENT task. Information Retrieval (to appear)
Google Scholar
Sakai, T., Song, R.: Evaluating diversified search results using per-intent graded relevance. In: Proceedings of ACM SIGIR 2011 (2011)
Google Scholar
Song, R., Zhang, M., Sakai, T., Kato, M.P., Liu, Y., Sugimoto, M., Wang, Q., Orii, N.: Overview of the NTCIR-9 INTENT task. In: Proceedings of NTCIR-9, pp. 82–105 (2011)
Google Scholar
Voorhees, E.M.: The Philosophy of Information Retrieval Evaluation. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 355–370. Springer, Heidelberg (2002)
Chapter Google Scholar
Webber, W., Park, L.A.: Score adjustment for correction of pooling bias. In: Proceedings of ACM SIGIR 2009, pp. 444–451 (2009)
Google Scholar
Yilmaz, E., Aslam, J., Robertson, S.: A new rank correlation coefficient for information retrieval. In: Proceedings of ACM SIGIR 2008, pp. 587–594 (2008)
Google Scholar
Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. In: ACM CIKM 2006 Proceedings, pp. 102–111 (2006)
Google Scholar
Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proceedings of ACM SIGIR 1998, pp. 307–314 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research Asia, P.R. China
Tetsuya Sakai, Zhicheng Dou & Ruihua Song
National Institute of Informatics, Japan
Noriko Kando

Authors

Tetsuya Sakai
View author publications
You can also search for this author in PubMed Google Scholar
Zhicheng Dou
View author publications
You can also search for this author in PubMed Google Scholar
Ruihua Song
View author publications
You can also search for this author in PubMed Google Scholar
Noriko Kando
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of computer Science and Technology, Tianjin University, Tianjin, 300072, China
Yuexian Hou
DIRO, University of Montreal, CP. 6128, succursale Centre-ville, H3C 3J7, Montreal, QC, Canada
Jian-Yun Nie
Institute of Software, Storage & Information Retrieval Laboratory, Chinese Academy of Sciences, 100190, Beijing, China
Le Sun
School of Computer Science and Technology, Tianjin University, 300072, Tianjin, China
Bo Wang
School of Computing, Robert Gordon University, St Andrew Street, AB25 1HG, Aberdeen, UK
Peng Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sakai, T., Dou, Z., Song, R., Kando, N. (2012). The Reusability of a Diversified Search Test Collection. In: Hou, Y., Nie, JY., Sun, L., Wang, B., Zhang, P. (eds) Information Retrieval Technology. AIRS 2012. Lecture Notes in Computer Science, vol 7675. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35341-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-35341-3_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35340-6
Online ISBN: 978-3-642-35341-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics