Abstract
The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of text. Collection selection and partial collection replication with replica selection are two such mechanisms that enable IR systems to search a small percentage of data and thus improve performance and scalability. To maintain effectiveness as well as efficiency, IR systems must be configured carefully to consider workload locality and possible collection organizations. We propose IR system architectures that incorporate collection selection and partial replication, and compare configurations using a validated simulator. Locality and collection organization have dramatic effects on performance. For example, we demonstrate with simulation results that collection selection performs especially well when the distribution of queries to collections is uniform and collections are organized by topics, but it suffers when particular collections are “hot.” We find that when queries have even modest locality, configurations that replicate data outperform those that partition data, usually significantly. These results can be used as the basis for IR system designs under a variety of workloads and collection organizations.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Baentsch, M., Molter, G., and Sturm, P. (1996). Introducing application-level replication and naming into today’s Web. In Proceedings of Fifth International World Wide Web Conference, Paris, France; Available at http://www5conf.inria.fr/fich-htmI/papers/P3/Overview.html.
Bell, D. and Grimson, J. (1992). Distributed Database Systems. Addison-Wesley Publishers.
Bestavros, A. (1995). Demand-based document dissemination to reduce traffic and balance load in distributed information systems. In Proceedings of SPDP’ 95: The 7th IEEE Symposium on Parallel and Distributed Processing, pages 338–345, San Anotonio, Texas.
Brown, E. W. and Chong, H. A. (1997). The GURU system in TREC-6. In Proceedings of the Sixth Text REtrieval Conference (TREC-6), pages 535–540, Gaithersburg, MD.
Burkowski, F., Cormack, G., Clarke, C., and Good, R. (1995). A global search architecture. Technical Report CS-95-12, Department of Computer Science, University of Waterloo, Waterloo, Canada.
Burkowski, F. J. (1990). Retrieval performance of a distributed text database utilizing a parallel process document server. In 1990 International Symposium On Databases in Parallel and Distributed Systems, pages 71–79, Trinity College, Dublin, Ireland.
Cahoon, B. and McKinley, K. S. (1996). Performance evaluation of a distributed architecture for information retrieval. In Proceedings of the Nineteenth Annual International ACM SlGIR Conference on Research and Development in Information Retrieval, pages 110–118, Zurich, Switzerland.
Cahoon, B., McKinley, K. S., and Lu, Z. (1999). Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transaction on Information Systems. To appear.
Callan, J. P., Croft, W. B., and Harding, S. M. (1992). The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert System Applications, pages 78–93, Valencia, Spain.
Callan, J. P., Lu, Z., and Croft, W. B. (1995). Searching distributed collections with inference networks. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–29, Seattle, WA.
Chakravarthy, A. and Haase, K. (1995). Netserf: Using semantic knowledge to find internet information archives. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4–11, Seattle, WA.
Couvreur, T. R., Benzel, R. N., Miller, S. F., Zeitler, D. N., Lee, D. L., Singhai, M., Shivaratri, N., and Wong, W. Y. P. (1994). An analysis of performance and cost factors in searching large text databases using parallel search systems. Journal of the American Society for Information Science, 7(45):443–464.
Danzig, P. B., Ahn, J., Noll, J., and Obraczka, K. (1991). Distributed indexing: A scalable mechanism for distributed information retrieval. In Proceedings of the Fourteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 221–229, Chicago, IL.
DeWitt, D., Graefe, G., Kumar, K. B., Gerber, R. H., Heytens, M. L., and Muralikrishna, M. (1986). GAMMA-a high performance dataflow database machine. In Proceedings of the Twelfth International Conference on Very Large Data Bases, pages 228–237, Kyoto, Japan.
DeWitt, D. and Gray, J. (1992). Parallel database systems: The future of high performance database systems. Communications of the ACM, 35(6):85–98.
French, J. C., Powell, A. L., Anc C. L. Viles, J. C., Emmeitt, T., Prey, K. J., and Mou, Y. (1999). Comparing the performance of database selection algorithms. In Proceedings of the Twenty-Second Annual International ACM SIGlR Conference on Research and Development in Information Retrieval, pages 238–245, Berkeley, CA.
Fuhr, N. (1999). A decision-theoretic approach to database selection in networked ir. ACM Transactions on Information Systems, 17(3):229–249.
Gravano, L. and Garcia-Molina, H. (1995). Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the Twenty First International Conference on Very Large Data Bases, pages 78–89, Zurich, Switchland.
Gravano, L., Garcia-Molina, H., and Tomasic, A. (1994). The effectiveness of GLOSS for the text database discovery problem. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, pages 126–137, Minneapolis, MN.
Hagmann, R. B. and Ferrari, D. (1986). Performance analysis of several backend database architectures. ACM Transactions on Database Systems, 11(1):1–26.
Harman, D., McCoy, W., Toense, R., and Candela, G. (1991). Prototyping a distributed information retrieval system that uses statistical ranking. Information Processing & Management, 27(5):449–460.
Hawking, D. (1997). Scalable text retrieval for large digital libraries. In First European Conference on Research and Advanced Technology for Digital Libraries, number 1324 in Lecture Notes in Computer Science, pages 127–145, Pisa, Italy. Springer.
Hawking, D., Craswell, N., and Thistlewaite, P. (1998). Overview of TREC-7 very large collection track. In Proceedings of the Seventh Text REtrieval Conference (TREC-7), pages 91–104, Gaithersburg, MD.
Hawking, D. and Thistlewaite, P. (1997). Overview of the TREC-6 very large collection track. In Proceedings of the Sixth Text REtrieval Conference (TREC-6), pages 93–106, Gaithersburg, MD.
Jain, A. and Dubes, R., editors (1988). Algorithms for Clustering Data. Prentice Hall.
Jump, J. R. (1993). YACSIM Reference Manual. Rice University, version 2.1.1 edition.
Katz, E., Butler, M., and McGrath, R. (1994). A scalable HTTP server: the NCSA prototype. Computer Networks and ISDN Systems, 27(2):155–164.
Lu, Z. (1999). Scalable Distributed Architectures For Information Retrieval. PhD thesis, University of Massachusetts at Amherst.
Lu, Z. and McKinley, K. S. (1999a). Partial replica selection based on relevance for information retrieval. In Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 97–104, Berkeley, CA.
Lu, Z. and McKinley, K. S. (1999b). Searching a terabyte of text using partial replication. Technical Report TR99-50, Department of Computer Science, University of Massachusetts at Amherst.
Mackert, L. F. and Lohman, G. M. (1986). R* optimizer validation and performance evaluation for distributed queries. In Proceedings of the Twelfth International Conference on Very Large Data Bases, pages 149–159, Kyoto, Japan.
Markatos, E. P. (1999). On caching search engine results. Technical Report 241, Institute of Computer Science (ICs) Foundation for Research & Technology-Hellas (FORTH), Greece.
Martin, T. P., Macleod, I. A., Russell, J. I., Lesse, K., and Foster, B. (1990). A case study of caching strategies for a distributed full text retrieval system. Information Processing & Management, 26(2):227–247.
Martin, T. P. and Russell, J. I. (1991). Data caching strategies for distributed full text retrieval systems. Information Systems, 16(1):1–11.
Papka, R. and Allan, J. (1998). Document classification using multiword features. In Proceedings of the Seventh International Conference on Information and Knowledge Management (CIKM), pages 124–131, Bethesda, MD.
Simpson, P. and Alonso, R. (1987). Data caching in information retrieval systems. In Proceedings of the Tenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 296–305, New Orleans, LA.
Stonebraker, M., Woodfill, J., Ranstrom, J., Kalash, J., Arnold, K., and Anderson, E. (1983). Performance analysis of distributed database systems. In Proceedings of the Third Symposium on Reliability in Distributed Software and Data base Systems, pages 135–138, Clearwater Beach, FL.
Tomasic, A. and Garcia-Molina, H. (1992). Caching and database scaling in distributed shared-nothing information retrieval systems. Technical Report STAN-CS-92-1456, Stanford University.
Turtle, H. R. (1991). Inference Networks for Document Retrieval. PhD thesis, University of Massachusetts.
Voorhees, E. M., Gupta, N. K., and Johnson-Laird, B. (1995). Learning collection fusion strategies. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172–179, Seattle, WA.
Wang, J. (1999). A survey of web caching schemes for the internet. Computer Communication Review, 29(5):36–46.
Xu, J. and Croft, W. (1999). Cluster-based language models for distributed retrieval. In Proceedings of the Twenty-Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 254–261, Berkeley, California.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Kluwer Academic Publishers
About this chapter
Cite this chapter
Lu, Z., McKinley, K.S. (2002). The Effect of Collection Organization and Query Locality on Information Retrieval System Performance. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_7
Download citation
DOI: https://doi.org/10.1007/0-306-47019-5_7
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-7812-9
Online ISBN: 978-0-306-47019-6
eBook Packages: Springer Book Archive