Abstract
Effective retrieval in a distributed environment is an important but difficult problem. Lack of effectiveness appears to have two major causes. First, existing collection selection algorithms do not work well on heterogeneous collections. Second, relevant documents are scattered over many collections and searching a few collections misses many relevant documents. We propose a topic-oriented approach to distributed retrieval. With this approach, we structure the document set of a distributed retrieval environment around a set of topics. Retrieval for a query involves first selecting the right topics for the query and then dispatching the search process to collections that contain such topics. The content of a topic is characterized by a language model. In environments where the labeling of documents by topics is unavailable, document clustering is employed for topic identification. Based on these ideas, three methods are proposed to suit different environments. We show that all three methods improve effectiveness of distributed retrieval.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Broglio, J., Callan, J. P., and Croft, W. (1994). An overview of the INQUERY system as used for the TIPSTER project. In Proceedings of the TIPSTER Workshop. Morgan Kaufmann.
Callan, J. P., Lu, Z., and Croft, W. (1995). Searching distributed collections with inference networks. In Proceedings of 18th ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 21–28.
Chakravarthy, A. and Hasse, K. (1995). NetSerf: Using semantic knowledge to find Internet information archives. In Proceedings of the 18th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 4–11.
Cutting, D., Karger, D. R., Pedersen, J. O., and Tukey, J. W. (1992). Scatter/gather: a cluster-based approach to broswing large document collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in lnformation Retrieval.
Danzig, P., Ahn, J., Noll, J., and Obraczka, K. (1991). Distributed indexing: A scalable mechanism for distributed information retrieval. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 220–229.
Dolin, R., Agrawal, D., Dillon, L., and Abbadi, A. E. (1996). Pharos: a scalable distributed architecture for locating heterogeneous information sources. Technical Report TRCS96-05, Computer Science Department, University of California, Santa Barbara.
French, J., Powell, A., Viles, C., Emmitt, T., and Prey, K. (1998). Evaluating database selection techniques: A testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 121–129.
Gravano, L., Garcia-Molina, H., and Tomasic, A. (1994). The effectiveness of GlOSS for the text data base discovery problem. In Proceedings of SIGMOD 94, pages 126–137. ACM.
Hawking, D. and Thistlewaite, P. (1999). Methods for information server selection. ACM Transactions on Office lnformation Systems, 17(1):40–76.
Hearst, M. and Pedersen, J. O. (1996). Reeaxming the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 76–84.
Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 329–338.
Jain, A. and Dubes, R. (1988). Algorithms for Clustering Data. Prentice Hall.
Kahle, B. and Medlar, A. (1991). An information system for corporate users: Wide Area Information Servers. Technical Report TMC 199, Thinking Machines Corporation.
Kullback, S., Keegel, J., and Kullback, J. (1987). Topics In Statistical Information Theory. Springer-Verlag.
Larkey, L. (1998). Some issues in the automatic classification of U.S. patents. In Learning for Text Categorization. Papers from the 1998 Workshop. AAAI Press, pages 87–90.
Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the web. Nature, 400:107–109.
Miller, D., Leek, T., and Schwartz, R. (1999). A hidden markov model informa-tion retrieval system. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 214–221.
Ponte, J. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 275–281.
Rabiner, L. and Juang, B. (1993). Fundamentals of Speech Recognition. Prentice Hall.
Salton, G. (1989). Automatic Text Processing. Addison Wesley.
Silverstein, C. and Pedersen, J. O. (1997). Almost constant-time clustering of arbitrary corpus subsets. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 60–66.
Song, F. and Croft, W. B. (1999). A general language model for information retrieval. In Proceedings of Eighth International Conference on Information and Knowledge Management (CIKM), pages 316–321.
van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths, second edition.
Voorhees, E. and Harman, D., editors (1998). TREC7 Proceedings. NIST.
Weiss, R., Velez, B., Sheldon, M., Namprempre, C., Szilagyi, P., Duda, A., and Gifford, D. (1996). Hypursuit: A hiearchical network search engine that exploits content-link hypertext clustering. In Proceedings of the 7th ACM Conference on Hypertext.
Xu, J. and Callan, J. (1998). Effective retrieval with distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 112–120.
Yamron, J. (1997). Topic detection and tracking segmentation task. In Proceedings of the DARPA Topic Detection and Tracking Workshop, (unpublished).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Kluwer Academic Publishers
About this chapter
Cite this chapter
Xu, J., Croft, W.B. (2002). Topic-Based Language Models for Distributed Retrieval. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_6
Download citation
DOI: https://doi.org/10.1007/0-306-47019-5_6
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-7812-9
Online ISBN: 978-0-306-47019-6
eBook Packages: Springer Book Archive