Abstract
A multi-database model of distributed information retrieval is presented, in which people are assumed to have access to many searchable text databases. In such an environment, full-text information retrieval consists of discovering database contents, ranking databases by their expected ability to satisfy the query, searching a small number of databases, and merging results returned by different databases. This paper presents algorithms for each task. It also discusses how to reorganize conventional test collections into multi-database testbeds, and evaluation methodologies for multi-database experiments. A broad and diverse group of experimental results is presented to demonstrate that the algorithms are effective, efficient, robust, and scalable.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Allan, J., Ballesteros, L., Callan, J. P., Croft, W. B., and Lu, Z. (1996). Recent experiments with inquery. In Harman, D., editor, Proceedings of the Fourth Text REtrieval Conference (TREC-4). National Institute of Standards and Technology Special Publication.
Callan, J. (1999a). Distributed IR testbed definition: trec123-100-bysource-callan99.v2a. Technical report, Language Technologies Institute, Carnegie Mellon University. Available at http://www.cs.cmu.edu/~callan/Data/.
Callan, J. (1999b). Distributed IR testbed definition: trec123-17-bysource-callan99.vla. Technical report, Language Technologies Institute, Carnegie Mellon University. Available at http://www.cs.cmu.edu/Ncallan/Data/.
Callan, J. (1999c). Distributed IR testbed definition: trecvlcl-921-bysource-callan99. Technical report, Language Technologies Institute, Carnegie Mellon University. Available at http://www.cs.cmu.edu/~callan/Data/.
Callan, J. and Connell, M. (1999). Query-based sampling of text databases. Technical Report IR-180, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts.
Callan, J., Connell, M., and Du, A. (1999a). Automatic discovery of language models for text databases. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pages 479–490, Philadelphia. ACM.
Callan, J., Powell, A. L., French, J. C., and Connell, M. (1999b). The effects of query-based sampling on automatic database selection algorithms. Technical Report IR-181, Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts.
Callan, J. P., Croft, W. B., and Broglio, J. (1995a). TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31(3):327–343.
Callan, J. P., Lu, Z., and Croft, W. B. (1995b). Searching distributed collections with inference networks. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–28, Seattle. ACM.
Chakravarthy, A. and Haase, K. (1995). Netserf: Using semantic knowledge to find Internet information archives. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4–11, Seattle. ACM.
Du, A. and Callan, J. (1998). Probing a collection to discover its language model. Technical Report 98-29, Department of Computer Science, University of Massachusetts.
Dumais, S. T. (1994). Latent semantic indexing (LSI) and TREC-2. In Harman, D. K., editor, The Second Text REtrieval Conference (TREC-2), pages 105–115, Gaithersburg, MD. National Institute of Standards and Technology, Special Publication 500-215.
French, J., Powell, A., Callan, J., Viles, C., Emmitt, T., Prey, K., and Y. Mou (1999). Comparing the performance of database selection algorithms. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 238–245. ACM.
French, J., Powell, A., Viles, C., Emmitt, T., and Prey, K. (1998). Evaluating database selection techniques: A testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM.
Fuhr, N. (1999). A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems, 17(3):229–249.
Gravano, L., Change, K., Garcia-Molina, H., and Paepcke, A. (1996). STARTS Stanford protocol proposal for Internet retrieval and search. Technical Report SIDL-WP-1996-0043, Computer Science Department, Stanford University.
Gravano, L. and Garcia-Molina, H. (1995). Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the 21st International Conference on Very Large Databases (VLDB), pages 78–89.
Gravano, L., Garcia-Molina, H., and Tomasic, A. (1994). The effectiveness of GLOSS for the text database discovery problem. In Proceedings of the ACM-SIGMOD International Conference on Management of Data, pages 126–137. ACM. SIGMOD Record 23(2).
Harman, D., editor (1994). The Second Text REtrieval Conference (TREC2). National Institute of Standards and Technology Special Publication 500-215, Gaithersburg, MD.
Harman, D., editor (1995). Proceedings of the Third Text REtrieval Conference (TREC-3). National Institute of Standards and Technology Special Publication 500-225, Gaithersburg, MD.
Harman, D., editor (1997). Proceedings of the Fifth Text Retrieval Conference (TREC-5). National Institute of Standards and Technology Special Publication 500-238, Gaithersburg, MD.
Hawking, D. and Thistlewaite, P. (1999). Methods for information server selection. ACM Transactions on Information Systems, 17(1):40–76.
Kirsch, S. T. (1997). Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents. U.S. Patent 5,659,732.
Kwok, K. L., Grunfeld, L., and Lewis, D. D. (1995). TREC-3 ad-hoc, routing retrieval and thresholding experiments using PIRCS. In Harman, D. K., editor, The Third Text Retrieval Conference (TREC-3), Gaithersburg, MD. National Institute of Standards and Technology, Special Publication 500-225.
Lu, Z., Callan, J., and Croft, W. (1996a). Applying inference Networks to multiple collection searching. Technical Report 96-42, Department of Computer Science, University of Massachusetts.
Lu, Z., Callan, J., and Croft, W. (1996b). Measures in collection ranking evaluation. Technical Report 96-39, Department of Computer Science, University of Massachusetts.
Marcus, R. S. (1983). An experimental comparison of the effectiveness of computers and humans as search intermediaries. Journal of the American Society for Information Science, 34:381–404.
Moroney, M. (1951). Facts from figures. Penguin, Baltimore.
Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1992). Numerical recipiesin C: The art of scientific computing. Cambridge University Press.
Robertson, S. and Walker, S. (1994). Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 232–241, Dublin, Ireland. ACM.
Turtle, H. (1990). Inference networks for document retrieval. Technical Report COINS Report 90-7, Computer and Information Science Department, University of Massachusetts, Amherst, MA 01003.
Turtle, H. R. and Croft, W. B. (1991). Efficient probabilistic inference for text retrieval. In RIAO 3 Conference Proceedings, pages 644–661, Barcelona, Spain.
Viles, C. L. and French, J. C. (1995). Dissemination of collection wide information in a distributed Information Retrieval system. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 12–20, Seattle. ACM.
Voorhees, E., Gupta, N., and Johnson-Laird, B. (1995a). Learning collection fusion strategies. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172–179, Seattle. ACM.
Voorhees, E. M., Gupta, N. K., and Johnson-Laird, B. (1995b). The collection fusion problem. In Harman, D. K., editor, The Third Text REtrieval Conference (TREC-3), Gaithersburg, MD. National Institute of Standards and Technology, Special Publication 500-225.
Xu, J. and Callan, J. (1998). Effective retrieval of distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 112–120, Melbourne. ACM.
Xu, J. and Croft, W. (1999). Cluster-based language models for distributed retrieval. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 254–261, Berkeley. ACM.
Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Reading, MA.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Kluwer Academic Publishers
About this chapter
Cite this chapter
Callan, J. (2002). Distributed Information Retrieval. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_5
Download citation
DOI: https://doi.org/10.1007/0-306-47019-5_5
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-7812-9
Online ISBN: 978-0-306-47019-6
eBook Packages: Springer Book Archive