Distributed Information Retrieval

Callan, Jamie

doi:10.1007/0-306-47019-5_5

Jamie Callan³

Part of the book series: The Information Retrieval Series ((INRE,volume 7))

307 Accesses
57 Citations

Abstract

A multi-database model of distributed information retrieval is presented, in which people are assumed to have access to many searchable text databases. In such an environment, full-text information retrieval consists of discovering database contents, ranking databases by their expected ability to satisfy the query, searching a small number of databases, and merging results returned by different databases. This paper presents algorithms for each task. It also discusses how to reorganize conventional test collections into multi-database testbeds, and evaluation methodologies for multi-database experiments. A broad and diverse group of experimental results is presented to demonstrate that the algorithms are effective, efficient, robust, and scalable.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allan, J., Ballesteros, L., Callan, J. P., Croft, W. B., and Lu, Z. (1996). Recent experiments with inquery. In Harman, D., editor, Proceedings of the Fourth Text REtrieval Conference (TREC-4). National Institute of Standards and Technology Special Publication.
Google Scholar
Callan, J. (1999a). Distributed IR testbed definition: trec123-100-bysource-callan99.v2a. Technical report, Language Technologies Institute, Carnegie Mellon University. Available at http://www.cs.cmu.edu/~callan/Data/.
Callan, J. (1999b). Distributed IR testbed definition: trec123-17-bysource-callan99.vla. Technical report, Language Technologies Institute, Carnegie Mellon University. Available at http://www.cs.cmu.edu/Ncallan/Data/.
Callan, J. (1999c). Distributed IR testbed definition: trecvlcl-921-bysource-callan99. Technical report, Language Technologies Institute, Carnegie Mellon University. Available at http://www.cs.cmu.edu/~callan/Data/.
Callan, J. and Connell, M. (1999). Query-based sampling of text databases. Technical Report IR-180, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts.
Google Scholar
Callan, J., Connell, M., and Du, A. (1999a). Automatic discovery of language models for text databases. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pages 479–490, Philadelphia. ACM.
Google Scholar
Callan, J., Powell, A. L., French, J. C., and Connell, M. (1999b). The effects of query-based sampling on automatic database selection algorithms. Technical Report IR-181, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts.
Google Scholar
Callan, J. P., Croft, W. B., and Broglio, J. (1995a). TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31(3):327–343.
Article Google Scholar
Callan, J. P., Lu, Z., and Croft, W. B. (1995b). Searching distributed collections with inference networks. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–28, Seattle. ACM.
Google Scholar
Chakravarthy, A. and Haase, K. (1995). Netserf: Using semantic knowledge to find Internet information archives. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4–11, Seattle. ACM.
Google Scholar
Du, A. and Callan, J. (1998). Probing a collection to discover its language model. Technical Report 98-29, Department of Computer Science, University of Massachusetts.
Google Scholar
Dumais, S. T. (1994). Latent semantic indexing (LSI) and TREC-2. In Harman, D. K., editor, The Second Text REtrieval Conference (TREC-2), pages 105–115, Gaithersburg, MD. National Institute of Standards and Technology, Special Publication 500-215.
Google Scholar
French, J., Powell, A., Callan, J., Viles, C., Emmitt, T., Prey, K., and Y. Mou (1999). Comparing the performance of database selection algorithms. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 238–245. ACM.
Google Scholar
French, J., Powell, A., Viles, C., Emmitt, T., and Prey, K. (1998). Evaluating database selection techniques: A testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM.
Google Scholar
Fuhr, N. (1999). A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems, 17(3):229–249.
Article Google Scholar
Gravano, L., Change, K., Garcia-Molina, H., and Paepcke, A. (1996). STARTS Stanford protocol proposal for Internet retrieval and search. Technical Report SIDL-WP-1996-0043, Computer Science Department, Stanford University.
Google Scholar
Gravano, L. and Garcia-Molina, H. (1995). Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the 21st International Conference on Very Large Databases (VLDB), pages 78–89.
Google Scholar
Gravano, L., Garcia-Molina, H., and Tomasic, A. (1994). The effectiveness of GLOSS for the text database discovery problem. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pages 126–137. ACM. SIGMOD Record 23(2).
Google Scholar
Harman, D., editor (1994). The Second Text REtrieval Conference (TREC2). National Institute of Standards and Technology Special Publication 500-215, Gaithersburg, MD.
Google Scholar
Harman, D., editor (1995). Proceedings of the Third Text REtrieval Conference (TREC-3). National Institute of Standards and Technology Special Publication 500-225, Gaithersburg, MD.
Google Scholar
Harman, D., editor (1997). Proceedings of the Fifth Text Retrieval Conference (TREC-5). National Institute of Standards and Technology Special Publication 500-238, Gaithersburg, MD.
Google Scholar
Hawking, D. and Thistlewaite, P. (1999). Methods for information server selection. ACM Transactions on Information Systems, 17(1):40–76.
Article Google Scholar
Kirsch, S. T. (1997). Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents. U.S. Patent 5,659,732.
Google Scholar
Kwok, K. L., Grunfeld, L., and Lewis, D. D. (1995). TREC-3 ad-hoc, routing retrieval and thresholding experiments using PIRCS. In Harman, D. K., editor, The Third Text Retrieval Conference (TREC-3), Gaithersburg, MD. National Institute of Standards and Technology, Special Publication 500-225.
Google Scholar
Lu, Z., Callan, J., and Croft, W. (1996a). Applying inference Networks to multiple collection searching. Technical Report 96-42, Department of Computer Science, University of Massachusetts.
Google Scholar
Lu, Z., Callan, J., and Croft, W. (1996b). Measures in collection ranking evaluation. Technical Report 96-39, Department of Computer Science, University of Massachusetts.
Google Scholar
Marcus, R. S. (1983). An experimental comparison of the effectiveness of computers and humans as search intermediaries. Journal of the American Society for Information Science, 34:381–404.
Google Scholar
Moroney, M. (1951). Facts from figures. Penguin, Baltimore.
Google Scholar
Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1992). Numerical recipiesin C: The art of scientific computing. Cambridge University Press.
Google Scholar
Robertson, S. and Walker, S. (1994). Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 232–241, Dublin, Ireland. ACM.
Google Scholar
Turtle, H. (1990). Inference networks for document retrieval. Technical Report COINS Report 90-7, Computer and Information Science Department, University of Massachusetts, Amherst, MA 01003.
Google Scholar
Turtle, H. R. and Croft, W. B. (1991). Efficient probabilistic inference for text retrieval. In RIAO 3 Conference Proceedings, pages 644–661, Barcelona, Spain.
Google Scholar
Viles, C. L. and French, J. C. (1995). Dissemination of collection wide information in a distributed Information Retrieval system. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 12–20, Seattle. ACM.
Google Scholar
Voorhees, E., Gupta, N., and Johnson-Laird, B. (1995a). Learning collection fusion strategies. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172–179, Seattle. ACM.
Google Scholar
Voorhees, E. M., Gupta, N. K., and Johnson-Laird, B. (1995b). The collection fusion problem. In Harman, D. K., editor, The Third Text REtrieval Conference (TREC-3), Gaithersburg, MD. National Institute of Standards and Technology, Special Publication 500-225.
Google Scholar
Xu, J. and Callan, J. (1998). Effective retrieval of distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 112–120, Melbourne. ACM.
Google Scholar
Xu, J. and Croft, W. (1999). Cluster-based language models for distributed retrieval. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 254–261, Berkeley. ACM.
Google Scholar
Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Reading, MA.
Google Scholar

Download references

Author information

Authors and Affiliations

Language Technologies Institute School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
Jamie Callan

Authors

Jamie Callan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Massachusetts, Amherst
W. Bruce Croft

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Callan, J. (2002). Distributed Information Retrieval. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_5

Download citation

DOI: https://doi.org/10.1007/0-306-47019-5_5
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-7812-9
Online ISBN: 978-0-306-47019-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics