Skip to main content

Sample Sizes for Query Probing in Uncooperative Distributed Information Retrieval

  • Conference paper
Book cover Frontiers of WWW Research and Development - APWeb 2006 (APWeb 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3841))

Included in the following conference series:

Abstract

The goal of distributed information retrieval is to support effective searching over multiple document collections. For efficiency, queries should be routed to only those collections that are likely to contain relevant documents, so it is necessary to first obtain information about the content of the target collections. In an uncooperative environment, query probing — where randomly-chosen queries are used to retrieve a sample of the documents and thus of the lexicon — has been proposed as a technique for estimating statistical term distributions. In this paper we rebut the claim that a sample of 300 documents is sufficient to provide good coverage of collection terms. We propose a novel sampling strategy and experimentally demonstrate that sample size needs to vary from collection to collection, that our methods achieve good coverage based on variable-sized samples, and that we can use the results of a probe to determine when to stop sampling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)

    Google Scholar 

  2. Bailey, P., Craswell, N., Hawking, D.: Engineering a multi-purpose test collection for web retrieval experiments. Inf. Process. Manage. 39(6), 853–871 (2003)

    Article  Google Scholar 

  3. Callan, J., Connell, M.: Query-based sampling of text databases. ACM Trans. Inf. Syst. 19(2), 97–130 (2001)

    Article  Google Scholar 

  4. Callan, J., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, pp. 21–28. ACM Press, New York (1995)

    Chapter  Google Scholar 

  5. Callan, J., Connell, M., Du, A.: Automatic discovery of language models for text databases. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, pp. 479–490. ACM Press, New York (1999)

    Chapter  Google Scholar 

  6. Craswell, N., Bailey, P., Hawking, D.: Server selection on the world wide web. In: Proceedings of the fifth ACM Conference on Digital Libraries, San Antonio, Texas, pp. 37–46. ACM Press, New York (2000)

    Chapter  Google Scholar 

  7. D’Souza, D., Thom, J., Zobel, J.: Collection selection for managed distributed document databases. Inf. Process. Manage. 40(3), 527–546 (2004a)

    Article  Google Scholar 

  8. D’Souza, D., Zobel, J., Thom, J.: Is CORI effective for collection selection? an exploration of parameters, queries, and data. In: Bruza, P., Moffat, A., Turpin, A. (eds.) Proceedings of the Australian Document Computing Symposium, Melbourne, Australia, pp. 41–46 (2004b)

    Google Scholar 

  9. French, J., Powell, A.L., Callan, J., Viles, C.L., Emmitt, T., Prey, K.J., Mou, Y.: Comparing the performance of database selection algorithms. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, pp. 238–245. ACM Press, New York (1999)

    Chapter  Google Scholar 

  10. Gravano, L., Garcia-Molina, H., Tomasic, A.: GlOSS: text-source discovery over the internet. ACM Trans. Database Syst. 24(2), 229–264 (1999)

    Article  Google Scholar 

  11. Gravano, L., Ipeirotis, P.G., Sahami, M.: Qprober: A system for automatic classification of hidden-web databases. ACM Trans. Inf. Syst. 21(1), 1–41 (2003)

    Article  Google Scholar 

  12. Ipeirotis, P.: Classifying and Searching Hidden-Web Text Databases. PhD thesis, Columbia University, USA (2004)

    Google Scholar 

  13. Ipeirotis, P.G., Gravano, L.: When one sample is not enough: improving text database selection using shrinkage. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France, pp. 767–778. ACM Press, New York (2004)

    Chapter  Google Scholar 

  14. Jansen, B.J., Spink, A., Saracevic, T.: Real life, real users, and real needs: a study and analysis of user queries on the web. Inf. Process. Manage. 36(2), 207–227 (2000)

    Article  Google Scholar 

  15. Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of Research Development 2(2), 159–165 (1958)

    Article  MathSciNet  Google Scholar 

  16. Meng, W., Yu, C., Liu, K.: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1), 48–89 (2002)

    Article  Google Scholar 

  17. Powell, A.L., French, J.: Comparing the performance of collection selection algorithms. ACM Trans. Inf. Syst. 21(4), 412–456 (2003)

    Article  Google Scholar 

  18. Sanderson, M., Zobel, J.: Information retrieval system evaluation: Effort, sensitivity, and reliability. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, pp. 162–169. ACM Press, New York (2005)

    Chapter  Google Scholar 

  19. Si, L., Callan, J.: Relevant document distribution estimation method for resource selection. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 298–305. ACM Press, New York (2003)

    Google Scholar 

  20. Williams, H.E., Zobel, J.: Searchable words on the web. International Journal of Digital Libraries 5(2), 99–105 (2005)

    Article  Google Scholar 

  21. Yuwono, B., Lee, D.L.: Server ranking for distributed text retrieval systems on the internet. In: Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA), Melbourne, Australia, pp. 41–50. World Scientific Press, Singapore (1997)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Shokouhi, M., Scholer, F., Zobel, J. (2006). Sample Sizes for Query Probing in Uncooperative Distributed Information Retrieval. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_7

Download citation

  • DOI: https://doi.org/10.1007/11610113_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-31142-3

  • Online ISBN: 978-3-540-32437-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics