Skip to main content

Distributed Information Retrieval

  • Chapter
Advances in Information Retrieval

Part of the book series: The Information Retrieval Series ((INRE,volume 7))

Abstract

A multi-database model of distributed information retrieval is presented, in which people are assumed to have access to many searchable text databases. In such an environment, full-text information retrieval consists of discovering database contents, ranking databases by their expected ability to satisfy the query, searching a small number of databases, and merging results returned by different databases. This paper presents algorithms for each task. It also discusses how to reorganize conventional test collections into multi-database testbeds, and evaluation methodologies for multi-database experiments. A broad and diverse group of experimental results is presented to demonstrate that the algorithms are effective, efficient, robust, and scalable.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Allan, J., Ballesteros, L., Callan, J. P., Croft, W. B., and Lu, Z. (1996). Recent experiments with inquery. In Harman, D., editor, Proceedings of the Fourth Text REtrieval Conference (TREC-4). National Institute of Standards and Technology Special Publication.

    Google Scholar 

  • Callan, J. (1999a). Distributed IR testbed definition: trec123-100-bysource-callan99.v2a. Technical report, Language Technologies Institute, Carnegie Mellon University. Available at http://www.cs.cmu.edu/~callan/Data/.

  • Callan, J. (1999b). Distributed IR testbed definition: trec123-17-bysource-callan99.vla. Technical report, Language Technologies Institute, Carnegie Mellon University. Available at http://www.cs.cmu.edu/Ncallan/Data/.

  • Callan, J. (1999c). Distributed IR testbed definition: trecvlcl-921-bysource-callan99. Technical report, Language Technologies Institute, Carnegie Mellon University. Available at http://www.cs.cmu.edu/~callan/Data/.

  • Callan, J. and Connell, M. (1999). Query-based sampling of text databases. Technical Report IR-180, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts.

    Google Scholar 

  • Callan, J., Connell, M., and Du, A. (1999a). Automatic discovery of language models for text databases. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pages 479–490, Philadelphia. ACM.

    Google Scholar 

  • Callan, J., Powell, A. L., French, J. C., and Connell, M. (1999b). The effects of query-based sampling on automatic database selection algorithms. Technical Report IR-181, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts.

    Google Scholar 

  • Callan, J. P., Croft, W. B., and Broglio, J. (1995a). TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31(3):327–343.

    Article  Google Scholar 

  • Callan, J. P., Lu, Z., and Croft, W. B. (1995b). Searching distributed collections with inference networks. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–28, Seattle. ACM.

    Google Scholar 

  • Chakravarthy, A. and Haase, K. (1995). Netserf: Using semantic knowledge to find Internet information archives. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4–11, Seattle. ACM.

    Google Scholar 

  • Du, A. and Callan, J. (1998). Probing a collection to discover its language model. Technical Report 98-29, Department of Computer Science, University of Massachusetts.

    Google Scholar 

  • Dumais, S. T. (1994). Latent semantic indexing (LSI) and TREC-2. In Harman, D. K., editor, The Second Text REtrieval Conference (TREC-2), pages 105–115, Gaithersburg, MD. National Institute of Standards and Technology, Special Publication 500-215.

    Google Scholar 

  • French, J., Powell, A., Callan, J., Viles, C., Emmitt, T., Prey, K., and Y. Mou (1999). Comparing the performance of database selection algorithms. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 238–245. ACM.

    Google Scholar 

  • French, J., Powell, A., Viles, C., Emmitt, T., and Prey, K. (1998). Evaluating database selection techniques: A testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM.

    Google Scholar 

  • Fuhr, N. (1999). A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems, 17(3):229–249.

    Article  Google Scholar 

  • Gravano, L., Change, K., Garcia-Molina, H., and Paepcke, A. (1996). STARTS Stanford protocol proposal for Internet retrieval and search. Technical Report SIDL-WP-1996-0043, Computer Science Department, Stanford University.

    Google Scholar 

  • Gravano, L. and Garcia-Molina, H. (1995). Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the 21st International Conference on Very Large Databases (VLDB), pages 78–89.

    Google Scholar 

  • Gravano, L., Garcia-Molina, H., and Tomasic, A. (1994). The effectiveness of GLOSS for the text database discovery problem. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pages 126–137. ACM. SIGMOD Record 23(2).

    Google Scholar 

  • Harman, D., editor (1994). The Second Text REtrieval Conference (TREC2). National Institute of Standards and Technology Special Publication 500-215, Gaithersburg, MD.

    Google Scholar 

  • Harman, D., editor (1995). Proceedings of the Third Text REtrieval Conference (TREC-3). National Institute of Standards and Technology Special Publication 500-225, Gaithersburg, MD.

    Google Scholar 

  • Harman, D., editor (1997). Proceedings of the Fifth Text Retrieval Conference (TREC-5). National Institute of Standards and Technology Special Publication 500-238, Gaithersburg, MD.

    Google Scholar 

  • Hawking, D. and Thistlewaite, P. (1999). Methods for information server selection. ACM Transactions on Information Systems, 17(1):40–76.

    Article  Google Scholar 

  • Kirsch, S. T. (1997). Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents. U.S. Patent 5,659,732.

    Google Scholar 

  • Kwok, K. L., Grunfeld, L., and Lewis, D. D. (1995). TREC-3 ad-hoc, routing retrieval and thresholding experiments using PIRCS. In Harman, D. K., editor, The Third Text Retrieval Conference (TREC-3), Gaithersburg, MD. National Institute of Standards and Technology, Special Publication 500-225.

    Google Scholar 

  • Lu, Z., Callan, J., and Croft, W. (1996a). Applying inference Networks to multiple collection searching. Technical Report 96-42, Department of Computer Science, University of Massachusetts.

    Google Scholar 

  • Lu, Z., Callan, J., and Croft, W. (1996b). Measures in collection ranking evaluation. Technical Report 96-39, Department of Computer Science, University of Massachusetts.

    Google Scholar 

  • Marcus, R. S. (1983). An experimental comparison of the effectiveness of computers and humans as search intermediaries. Journal of the American Society for Information Science, 34:381–404.

    Google Scholar 

  • Moroney, M. (1951). Facts from figures. Penguin, Baltimore.

    Google Scholar 

  • Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1992). Numerical recipiesin C: The art of scientific computing. Cambridge University Press.

    Google Scholar 

  • Robertson, S. and Walker, S. (1994). Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 232–241, Dublin, Ireland. ACM.

    Google Scholar 

  • Turtle, H. (1990). Inference networks for document retrieval. Technical Report COINS Report 90-7, Computer and Information Science Department, University of Massachusetts, Amherst, MA 01003.

    Google Scholar 

  • Turtle, H. R. and Croft, W. B. (1991). Efficient probabilistic inference for text retrieval. In RIAO 3 Conference Proceedings, pages 644–661, Barcelona, Spain.

    Google Scholar 

  • Viles, C. L. and French, J. C. (1995). Dissemination of collection wide information in a distributed Information Retrieval system. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 12–20, Seattle. ACM.

    Google Scholar 

  • Voorhees, E., Gupta, N., and Johnson-Laird, B. (1995a). Learning collection fusion strategies. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172–179, Seattle. ACM.

    Google Scholar 

  • Voorhees, E. M., Gupta, N. K., and Johnson-Laird, B. (1995b). The collection fusion problem. In Harman, D. K., editor, The Third Text REtrieval Conference (TREC-3), Gaithersburg, MD. National Institute of Standards and Technology, Special Publication 500-225.

    Google Scholar 

  • Xu, J. and Callan, J. (1998). Effective retrieval of distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 112–120, Melbourne. ACM.

    Google Scholar 

  • Xu, J. and Croft, W. (1999). Cluster-based language models for distributed retrieval. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 254–261, Berkeley. ACM.

    Google Scholar 

  • Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Reading, MA.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Kluwer Academic Publishers

About this chapter

Cite this chapter

Callan, J. (2002). Distributed Information Retrieval. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_5

Download citation

  • DOI: https://doi.org/10.1007/0-306-47019-5_5

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-7923-7812-9

  • Online ISBN: 978-0-306-47019-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics