The Effect of Collection Organization and Query Locality on Information Retrieval System Performance

Lu, Zhihong; McKinley, Kathryn S.

doi:10.1007/0-306-47019-5_7

The Effect of Collection Organization and Query Locality on Information Retrieval System Performance

Zhihong Lu³ &
Kathryn S. McKinley⁴

Chapter

261 Accesses
1 Citations

Part of the book series: The Information Retrieval Series ((INRE,volume 7))

Abstract

The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of text. Collection selection and partial collection replication with replica selection are two such mechanisms that enable IR systems to search a small percentage of data and thus improve performance and scalability. To maintain effectiveness as well as efficiency, IR systems must be configured carefully to consider workload locality and possible collection organizations. We propose IR system architectures that incorporate collection selection and partial replication, and compare configurations using a validated simulator. Locality and collection organization have dramatic effects on performance. For example, we demonstrate with simulation results that collection selection performs especially well when the distribution of queries to collections is uniform and collections are organized by topics, but it suffers when particular collections are “hot.” We find that when queries have even modest locality, configurations that replicate data outperform those that partition data, usually significantly. These results can be used as the basis for IR system designs under a variety of workloads and collection organizations.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baentsch, M., Molter, G., and Sturm, P. (1996). Introducing application-level replication and naming into today’s Web. In Proceedings of Fifth International World Wide Web Conference, Paris, France; Available at http://www5conf.inria.fr/fich-htmI/papers/P3/Overview.html.
Bell, D. and Grimson, J. (1992). Distributed Database Systems. Addison-Wesley Publishers.
Google Scholar
Bestavros, A. (1995). Demand-based document dissemination to reduce traffic and balance load in distributed information systems. In Proceedings of SPDP’ 95: The 7th IEEE Symposium on Parallel and Distributed Processing, pages 338–345, San Anotonio, Texas.
Google Scholar
Brown, E. W. and Chong, H. A. (1997). The GURU system in TREC-6. In Proceedings of the Sixth Text REtrieval Conference (TREC-6), pages 535–540, Gaithersburg, MD.
Google Scholar
Burkowski, F., Cormack, G., Clarke, C., and Good, R. (1995). A global search architecture. Technical Report CS-95-12, Department of Computer Science, University of Waterloo, Waterloo, Canada.
Google Scholar
Burkowski, F. J. (1990). Retrieval performance of a distributed text database utilizing a parallel process document server. In 1990 International Symposium On Databases in Parallel and Distributed Systems, pages 71–79, Trinity College, Dublin, Ireland.
Google Scholar
Cahoon, B. and McKinley, K. S. (1996). Performance evaluation of a distributed architecture for information retrieval. In Proceedings of the Nineteenth Annual International ACM SlGIR Conference on Research and Development in Information Retrieval, pages 110–118, Zurich, Switzerland.
Google Scholar
Cahoon, B., McKinley, K. S., and Lu, Z. (1999). Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transaction on Information Systems. To appear.
Google Scholar
Callan, J. P., Croft, W. B., and Harding, S. M. (1992). The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert System Applications, pages 78–93, Valencia, Spain.
Google Scholar
Callan, J. P., Lu, Z., and Croft, W. B. (1995). Searching distributed collections with inference networks. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–29, Seattle, WA.
Google Scholar
Chakravarthy, A. and Haase, K. (1995). Netserf: Using semantic knowledge to find internet information archives. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4–11, Seattle, WA.
Google Scholar
Couvreur, T. R., Benzel, R. N., Miller, S. F., Zeitler, D. N., Lee, D. L., Singhai, M., Shivaratri, N., and Wong, W. Y. P. (1994). An analysis of performance and cost factors in searching large text databases using parallel search systems. Journal of the American Society for Information Science, 7(45):443–464.
Google Scholar
Danzig, P. B., Ahn, J., Noll, J., and Obraczka, K. (1991). Distributed indexing: A scalable mechanism for distributed information retrieval. In Proceedings of the Fourteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 221–229, Chicago, IL.
Google Scholar
DeWitt, D., Graefe, G., Kumar, K. B., Gerber, R. H., Heytens, M. L., and Muralikrishna, M. (1986). GAMMA-a high performance dataflow database machine. In Proceedings of the Twelfth International Conference on Very Large Data Bases, pages 228–237, Kyoto, Japan.
Google Scholar
DeWitt, D. and Gray, J. (1992). Parallel database systems: The future of high performance database systems. Communications of the ACM, 35(6):85–98.
Article Google Scholar
French, J. C., Powell, A. L., Anc C. L. Viles, J. C., Emmeitt, T., Prey, K. J., and Mou, Y. (1999). Comparing the performance of database selection algorithms. In Proceedings of the Twenty-Second Annual International ACM SIGlR Conference on Research and Development in Information Retrieval, pages 238–245, Berkeley, CA.
Google Scholar
Fuhr, N. (1999). A decision-theoretic approach to database selection in networked ir. ACM Transactions on Information Systems, 17(3):229–249.
Article Google Scholar
Gravano, L. and Garcia-Molina, H. (1995). Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the Twenty First International Conference on Very Large Data Bases, pages 78–89, Zurich, Switchland.
Google Scholar
Gravano, L., Garcia-Molina, H., and Tomasic, A. (1994). The effectiveness of GLOSS for the text database discovery problem. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, pages 126–137, Minneapolis, MN.
Google Scholar
Hagmann, R. B. and Ferrari, D. (1986). Performance analysis of several backend database architectures. ACM Transactions on Database Systems, 11(1):1–26.
Article Google Scholar
Harman, D., McCoy, W., Toense, R., and Candela, G. (1991). Prototyping a distributed information retrieval system that uses statistical ranking. Information Processing & Management, 27(5):449–460.
Article Google Scholar
Hawking, D. (1997). Scalable text retrieval for large digital libraries. In First European Conference on Research and Advanced Technology for Digital Libraries, number 1324 in Lecture Notes in Computer Science, pages 127–145, Pisa, Italy. Springer.
Google Scholar
Hawking, D., Craswell, N., and Thistlewaite, P. (1998). Overview of TREC-7 very large collection track. In Proceedings of the Seventh Text REtrieval Conference (TREC-7), pages 91–104, Gaithersburg, MD.
Google Scholar
Hawking, D. and Thistlewaite, P. (1997). Overview of the TREC-6 very large collection track. In Proceedings of the Sixth Text REtrieval Conference (TREC-6), pages 93–106, Gaithersburg, MD.
Google Scholar
Jain, A. and Dubes, R., editors (1988). Algorithms for Clustering Data. Prentice Hall.
Google Scholar
Jump, J. R. (1993). YACSIM Reference Manual. Rice University, version 2.1.1 edition.
Google Scholar
Katz, E., Butler, M., and McGrath, R. (1994). A scalable HTTP server: the NCSA prototype. Computer Networks and ISDN Systems, 27(2):155–164.
Article Google Scholar
Lu, Z. (1999). Scalable Distributed Architectures For Information Retrieval. PhD thesis, University of Massachusetts at Amherst.
Google Scholar
Lu, Z. and McKinley, K. S. (1999a). Partial replica selection based on relevance for information retrieval. In Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 97–104, Berkeley, CA.
Google Scholar
Lu, Z. and McKinley, K. S. (1999b). Searching a terabyte of text using partial replication. Technical Report TR99-50, Department of Computer Science, University of Massachusetts at Amherst.
Google Scholar
Mackert, L. F. and Lohman, G. M. (1986). R^* optimizer validation and performance evaluation for distributed queries. In Proceedings of the Twelfth International Conference on Very Large Data Bases, pages 149–159, Kyoto, Japan.
Google Scholar
Markatos, E. P. (1999). On caching search engine results. Technical Report 241, Institute of Computer Science (ICs) Foundation for Research & Technology-Hellas (FORTH), Greece.
Google Scholar
Martin, T. P., Macleod, I. A., Russell, J. I., Lesse, K., and Foster, B. (1990). A case study of caching strategies for a distributed full text retrieval system. Information Processing & Management, 26(2):227–247.
Article Google Scholar
Martin, T. P. and Russell, J. I. (1991). Data caching strategies for distributed full text retrieval systems. Information Systems, 16(1):1–11.
Article MathSciNet Google Scholar
Papka, R. and Allan, J. (1998). Document classification using multiword features. In Proceedings of the Seventh International Conference on Information and Knowledge Management (CIKM), pages 124–131, Bethesda, MD.
Google Scholar
Simpson, P. and Alonso, R. (1987). Data caching in information retrieval systems. In Proceedings of the Tenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 296–305, New Orleans, LA.
Google Scholar
Stonebraker, M., Woodfill, J., Ranstrom, J., Kalash, J., Arnold, K., and Anderson, E. (1983). Performance analysis of distributed database systems. In Proceedings of the Third Symposium on Reliability in Distributed Software and Data base Systems, pages 135–138, Clearwater Beach, FL.
Google Scholar
Tomasic, A. and Garcia-Molina, H. (1992). Caching and database scaling in distributed shared-nothing information retrieval systems. Technical Report STAN-CS-92-1456, Stanford University.
Google Scholar
Turtle, H. R. (1991). Inference Networks for Document Retrieval. PhD thesis, University of Massachusetts.
Google Scholar
Voorhees, E. M., Gupta, N. K., and Johnson-Laird, B. (1995). Learning collection fusion strategies. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172–179, Seattle, WA.
Google Scholar
Wang, J. (1999). A survey of web caching schemes for the internet. Computer Communication Review, 29(5):36–46.
Google Scholar
Xu, J. and Croft, W. (1999). Cluster-based language models for distributed retrieval. In Proceedings of the Twenty-Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 254–261, Berkeley, California.
Google Scholar

Download references

Author information

Authors and Affiliations

AT&T Network Services, USA
Zhihong Lu
Department of Computer Science, University of Massachusetts, USA
Kathryn S. McKinley

Authors

Zhihong Lu
View author publications
You can also search for this author in PubMed Google Scholar
Kathryn S. McKinley
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Massachusetts, Amherst
W. Bruce Croft

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lu, Z., McKinley, K.S. (2002). The Effect of Collection Organization and Query Locality on Information Retrieval System Performance. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_7

Download citation

DOI: https://doi.org/10.1007/0-306-47019-5_7
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-7812-9
Online ISBN: 978-0-306-47019-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics