Skip to main content

The Effect of Collection Organization and Query Locality on Information Retrieval System Performance

  • Chapter

Part of the book series: The Information Retrieval Series ((INRE,volume 7))

Abstract

The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of text. Collection selection and partial collection replication with replica selection are two such mechanisms that enable IR systems to search a small percentage of data and thus improve performance and scalability. To maintain effectiveness as well as efficiency, IR systems must be configured carefully to consider workload locality and possible collection organizations. We propose IR system architectures that incorporate collection selection and partial replication, and compare configurations using a validated simulator. Locality and collection organization have dramatic effects on performance. For example, we demonstrate with simulation results that collection selection performs especially well when the distribution of queries to collections is uniform and collections are organized by topics, but it suffers when particular collections are “hot.” We find that when queries have even modest locality, configurations that replicate data outperform those that partition data, usually significantly. These results can be used as the basis for IR system designs under a variety of workloads and collection organizations.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Baentsch, M., Molter, G., and Sturm, P. (1996). Introducing application-level replication and naming into today’s Web. In Proceedings of Fifth International World Wide Web Conference, Paris, France; Available at http://www5conf.inria.fr/fich-htmI/papers/P3/Overview.html.

  • Bell, D. and Grimson, J. (1992). Distributed Database Systems. Addison-Wesley Publishers.

    Google Scholar 

  • Bestavros, A. (1995). Demand-based document dissemination to reduce traffic and balance load in distributed information systems. In Proceedings of SPDP’ 95: The 7th IEEE Symposium on Parallel and Distributed Processing, pages 338–345, San Anotonio, Texas.

    Google Scholar 

  • Brown, E. W. and Chong, H. A. (1997). The GURU system in TREC-6. In Proceedings of the Sixth Text REtrieval Conference (TREC-6), pages 535–540, Gaithersburg, MD.

    Google Scholar 

  • Burkowski, F., Cormack, G., Clarke, C., and Good, R. (1995). A global search architecture. Technical Report CS-95-12, Department of Computer Science, University of Waterloo, Waterloo, Canada.

    Google Scholar 

  • Burkowski, F. J. (1990). Retrieval performance of a distributed text database utilizing a parallel process document server. In 1990 International Symposium On Databases in Parallel and Distributed Systems, pages 71–79, Trinity College, Dublin, Ireland.

    Google Scholar 

  • Cahoon, B. and McKinley, K. S. (1996). Performance evaluation of a distributed architecture for information retrieval. In Proceedings of the Nineteenth Annual International ACM SlGIR Conference on Research and Development in Information Retrieval, pages 110–118, Zurich, Switzerland.

    Google Scholar 

  • Cahoon, B., McKinley, K. S., and Lu, Z. (1999). Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transaction on Information Systems. To appear.

    Google Scholar 

  • Callan, J. P., Croft, W. B., and Harding, S. M. (1992). The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert System Applications, pages 78–93, Valencia, Spain.

    Google Scholar 

  • Callan, J. P., Lu, Z., and Croft, W. B. (1995). Searching distributed collections with inference networks. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–29, Seattle, WA.

    Google Scholar 

  • Chakravarthy, A. and Haase, K. (1995). Netserf: Using semantic knowledge to find internet information archives. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 4–11, Seattle, WA.

    Google Scholar 

  • Couvreur, T. R., Benzel, R. N., Miller, S. F., Zeitler, D. N., Lee, D. L., Singhai, M., Shivaratri, N., and Wong, W. Y. P. (1994). An analysis of performance and cost factors in searching large text databases using parallel search systems. Journal of the American Society for Information Science, 7(45):443–464.

    Google Scholar 

  • Danzig, P. B., Ahn, J., Noll, J., and Obraczka, K. (1991). Distributed indexing: A scalable mechanism for distributed information retrieval. In Proceedings of the Fourteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 221–229, Chicago, IL.

    Google Scholar 

  • DeWitt, D., Graefe, G., Kumar, K. B., Gerber, R. H., Heytens, M. L., and Muralikrishna, M. (1986). GAMMA-a high performance dataflow database machine. In Proceedings of the Twelfth International Conference on Very Large Data Bases, pages 228–237, Kyoto, Japan.

    Google Scholar 

  • DeWitt, D. and Gray, J. (1992). Parallel database systems: The future of high performance database systems. Communications of the ACM, 35(6):85–98.

    Article  Google Scholar 

  • French, J. C., Powell, A. L., Anc C. L. Viles, J. C., Emmeitt, T., Prey, K. J., and Mou, Y. (1999). Comparing the performance of database selection algorithms. In Proceedings of the Twenty-Second Annual International ACM SIGlR Conference on Research and Development in Information Retrieval, pages 238–245, Berkeley, CA.

    Google Scholar 

  • Fuhr, N. (1999). A decision-theoretic approach to database selection in networked ir. ACM Transactions on Information Systems, 17(3):229–249.

    Article  Google Scholar 

  • Gravano, L. and Garcia-Molina, H. (1995). Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the Twenty First International Conference on Very Large Data Bases, pages 78–89, Zurich, Switchland.

    Google Scholar 

  • Gravano, L., Garcia-Molina, H., and Tomasic, A. (1994). The effectiveness of GLOSS for the text database discovery problem. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, pages 126–137, Minneapolis, MN.

    Google Scholar 

  • Hagmann, R. B. and Ferrari, D. (1986). Performance analysis of several backend database architectures. ACM Transactions on Database Systems, 11(1):1–26.

    Article  Google Scholar 

  • Harman, D., McCoy, W., Toense, R., and Candela, G. (1991). Prototyping a distributed information retrieval system that uses statistical ranking. Information Processing & Management, 27(5):449–460.

    Article  Google Scholar 

  • Hawking, D. (1997). Scalable text retrieval for large digital libraries. In First European Conference on Research and Advanced Technology for Digital Libraries, number 1324 in Lecture Notes in Computer Science, pages 127–145, Pisa, Italy. Springer.

    Google Scholar 

  • Hawking, D., Craswell, N., and Thistlewaite, P. (1998). Overview of TREC-7 very large collection track. In Proceedings of the Seventh Text REtrieval Conference (TREC-7), pages 91–104, Gaithersburg, MD.

    Google Scholar 

  • Hawking, D. and Thistlewaite, P. (1997). Overview of the TREC-6 very large collection track. In Proceedings of the Sixth Text REtrieval Conference (TREC-6), pages 93–106, Gaithersburg, MD.

    Google Scholar 

  • Jain, A. and Dubes, R., editors (1988). Algorithms for Clustering Data. Prentice Hall.

    Google Scholar 

  • Jump, J. R. (1993). YACSIM Reference Manual. Rice University, version 2.1.1 edition.

    Google Scholar 

  • Katz, E., Butler, M., and McGrath, R. (1994). A scalable HTTP server: the NCSA prototype. Computer Networks and ISDN Systems, 27(2):155–164.

    Article  Google Scholar 

  • Lu, Z. (1999). Scalable Distributed Architectures For Information Retrieval. PhD thesis, University of Massachusetts at Amherst.

    Google Scholar 

  • Lu, Z. and McKinley, K. S. (1999a). Partial replica selection based on relevance for information retrieval. In Proceedings of the 22th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 97–104, Berkeley, CA.

    Google Scholar 

  • Lu, Z. and McKinley, K. S. (1999b). Searching a terabyte of text using partial replication. Technical Report TR99-50, Department of Computer Science, University of Massachusetts at Amherst.

    Google Scholar 

  • Mackert, L. F. and Lohman, G. M. (1986). R* optimizer validation and performance evaluation for distributed queries. In Proceedings of the Twelfth International Conference on Very Large Data Bases, pages 149–159, Kyoto, Japan.

    Google Scholar 

  • Markatos, E. P. (1999). On caching search engine results. Technical Report 241, Institute of Computer Science (ICs) Foundation for Research & Technology-Hellas (FORTH), Greece.

    Google Scholar 

  • Martin, T. P., Macleod, I. A., Russell, J. I., Lesse, K., and Foster, B. (1990). A case study of caching strategies for a distributed full text retrieval system. Information Processing & Management, 26(2):227–247.

    Article  Google Scholar 

  • Martin, T. P. and Russell, J. I. (1991). Data caching strategies for distributed full text retrieval systems. Information Systems, 16(1):1–11.

    Article  MathSciNet  Google Scholar 

  • Papka, R. and Allan, J. (1998). Document classification using multiword features. In Proceedings of the Seventh International Conference on Information and Knowledge Management (CIKM), pages 124–131, Bethesda, MD.

    Google Scholar 

  • Simpson, P. and Alonso, R. (1987). Data caching in information retrieval systems. In Proceedings of the Tenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 296–305, New Orleans, LA.

    Google Scholar 

  • Stonebraker, M., Woodfill, J., Ranstrom, J., Kalash, J., Arnold, K., and Anderson, E. (1983). Performance analysis of distributed database systems. In Proceedings of the Third Symposium on Reliability in Distributed Software and Data base Systems, pages 135–138, Clearwater Beach, FL.

    Google Scholar 

  • Tomasic, A. and Garcia-Molina, H. (1992). Caching and database scaling in distributed shared-nothing information retrieval systems. Technical Report STAN-CS-92-1456, Stanford University.

    Google Scholar 

  • Turtle, H. R. (1991). Inference Networks for Document Retrieval. PhD thesis, University of Massachusetts.

    Google Scholar 

  • Voorhees, E. M., Gupta, N. K., and Johnson-Laird, B. (1995). Learning collection fusion strategies. In Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172–179, Seattle, WA.

    Google Scholar 

  • Wang, J. (1999). A survey of web caching schemes for the internet. Computer Communication Review, 29(5):36–46.

    Google Scholar 

  • Xu, J. and Croft, W. (1999). Cluster-based language models for distributed retrieval. In Proceedings of the Twenty-Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 254–261, Berkeley, California.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Kluwer Academic Publishers

About this chapter

Cite this chapter

Lu, Z., McKinley, K.S. (2002). The Effect of Collection Organization and Query Locality on Information Retrieval System Performance. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_7

Download citation

  • DOI: https://doi.org/10.1007/0-306-47019-5_7

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-7923-7812-9

  • Online ISBN: 978-0-306-47019-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics