Sample Sizes for Query Probing in Uncooperative Distributed Information Retrieval

Shokouhi, Milad; Scholer, Falk; Zobel, Justin

doi:10.1007/11610113_7

Milad Shokouhi²¹,
Falk Scholer²¹ &
Justin Zobel²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3841))

Included in the following conference series:

Asia-Pacific Web Conference

635 Accesses
9 Citations

Abstract

The goal of distributed information retrieval is to support effective searching over multiple document collections. For efficiency, queries should be routed to only those collections that are likely to contain relevant documents, so it is necessary to first obtain information about the content of the target collections. In an uncooperative environment, query probing — where randomly-chosen queries are used to retrieve a sample of the documents and thus of the lexicon — has been proposed as a technique for estimating statistical term distributions. In this paper we rebut the claim that a sample of 300 documents is sufficient to provide good coverage of collection terms. We propose a novel sampling strategy and experimentally demonstrate that sample size needs to vary from collection to collection, that our methods achieve good coverage based on variable-sized samples, and that we can use the results of a probe to determine when to stop sampling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)
Google Scholar
Bailey, P., Craswell, N., Hawking, D.: Engineering a multi-purpose test collection for web retrieval experiments. Inf. Process. Manage. 39(6), 853–871 (2003)
Article Google Scholar
Callan, J., Connell, M.: Query-based sampling of text databases. ACM Trans. Inf. Syst. 19(2), 97–130 (2001)
Article Google Scholar
Callan, J., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, pp. 21–28. ACM Press, New York (1995)
Chapter Google Scholar
Callan, J., Connell, M., Du, A.: Automatic discovery of language models for text databases. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, pp. 479–490. ACM Press, New York (1999)
Chapter Google Scholar
Craswell, N., Bailey, P., Hawking, D.: Server selection on the world wide web. In: Proceedings of the fifth ACM Conference on Digital Libraries, San Antonio, Texas, pp. 37–46. ACM Press, New York (2000)
Chapter Google Scholar
D’Souza, D., Thom, J., Zobel, J.: Collection selection for managed distributed document databases. Inf. Process. Manage. 40(3), 527–546 (2004a)
Article Google Scholar
D’Souza, D., Zobel, J., Thom, J.: Is CORI effective for collection selection? an exploration of parameters, queries, and data. In: Bruza, P., Moffat, A., Turpin, A. (eds.) Proceedings of the Australian Document Computing Symposium, Melbourne, Australia, pp. 41–46 (2004b)
Google Scholar
French, J., Powell, A.L., Callan, J., Viles, C.L., Emmitt, T., Prey, K.J., Mou, Y.: Comparing the performance of database selection algorithms. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, pp. 238–245. ACM Press, New York (1999)
Chapter Google Scholar
Gravano, L., Garcia-Molina, H., Tomasic, A.: GlOSS: text-source discovery over the internet. ACM Trans. Database Syst. 24(2), 229–264 (1999)
Article Google Scholar
Gravano, L., Ipeirotis, P.G., Sahami, M.: Qprober: A system for automatic classification of hidden-web databases. ACM Trans. Inf. Syst. 21(1), 1–41 (2003)
Article Google Scholar
Ipeirotis, P.: Classifying and Searching Hidden-Web Text Databases. PhD thesis, Columbia University, USA (2004)
Google Scholar
Ipeirotis, P.G., Gravano, L.: When one sample is not enough: improving text database selection using shrinkage. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France, pp. 767–778. ACM Press, New York (2004)
Chapter Google Scholar
Jansen, B.J., Spink, A., Saracevic, T.: Real life, real users, and real needs: a study and analysis of user queries on the web. Inf. Process. Manage. 36(2), 207–227 (2000)
Article Google Scholar
Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of Research Development 2(2), 159–165 (1958)
Article MathSciNet Google Scholar
Meng, W., Yu, C., Liu, K.: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1), 48–89 (2002)
Article Google Scholar
Powell, A.L., French, J.: Comparing the performance of collection selection algorithms. ACM Trans. Inf. Syst. 21(4), 412–456 (2003)
Article Google Scholar
Sanderson, M., Zobel, J.: Information retrieval system evaluation: Effort, sensitivity, and reliability. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, pp. 162–169. ACM Press, New York (2005)
Chapter Google Scholar
Si, L., Callan, J.: Relevant document distribution estimation method for resource selection. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 298–305. ACM Press, New York (2003)
Google Scholar
Williams, H.E., Zobel, J.: Searchable words on the web. International Journal of Digital Libraries 5(2), 99–105 (2005)
Article Google Scholar
Yuwono, B., Lee, D.L.: Server ranking for distributed text retrieval systems on the internet. In: Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA), Melbourne, Australia, pp. 41–50. World Scientific Press, Singapore (1997)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Information Technology, RMIT University, Melbourne, 3001, Australia
Milad Shokouhi, Falk Scholer & Justin Zobel

Authors

Milad Shokouhi
View author publications
You can also search for this author in PubMed Google Scholar
Falk Scholer
View author publications
You can also search for this author in PubMed Google Scholar
Justin Zobel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of ITEE, The University of Queensland, Australia
Xiaofang Zhou
School of Computer Science and Technology, Heilongjiang University, China
Jianzhong Li
School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, QLD, Australia
Heng Tao Shen
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
Victoria University, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shokouhi, M., Scholer, F., Zobel, J. (2006). Sample Sizes for Query Probing in Uncooperative Distributed Information Retrieval. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_7

Download citation

DOI: https://doi.org/10.1007/11610113_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31142-3
Online ISBN: 978-3-540-32437-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics