Skip to main content

Scalable text retrieval for large digital libraries

  • Information Retrieval I
  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1324))

Abstract

It is argued that digital libraries of the future will contain terabyte-scale collections of digital text and that full-text searching techniques will be required to operate over collections of this magnitude. Algorithms expected to be capable of scaling to these data sizes using clusters of modern workstations are described. First, basic indexing and retrieval algorithms operating at performance levels comparable to other leading systems over gigabytes of text on a single workstation are presented. Next, simple mechanisms for extending query processing capacity to much greater collection sizes are presented, to tens of gigabytes for single workstations and to terabytes for clusters of such workstations. Query-processing efficiency on a single workstation is shown to deteriorate dramatically when data size is increased above a certain multiple of physical memory size. By contrast, the number of clustered workstations necessary to maintain a constant level of service increases linearly with increasing data size. Experiments using clusters of up to 16 workstations are reported. A non-replicated 20 gigabyte collection was indexed in just over 5 hours using a ten workstation cluster and scalability results are presented for query processing over replicated collections of up to 102 gigabytes.

The author wishes to acknowledge that this work was carried out within the Cooperative Research Centre For Advanced Computational Systems established under the Australian Government's Cooperative Research Centre's Program.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. James Allan, Lisa Ballesteros, James P. Callan, W. Bruce Croft, and Zhihong Lu. Recent experiments with INQUERY. In Harman [6], pages 49–63.

    Google Scholar 

  2. T.C. Bell, A. Moffat, and I. H. Witten. Compressing the digital library. In Proc. Digital Libraries 94, 1994.

    Google Scholar 

  3. Eric W. Brown. Fast evaluation of structured queries for information retrieval. In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, Proc. 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 30–38, Seattle, Washington, July 1995. ACM Press.

    Google Scholar 

  4. Chris Buckley, Amit Singhal, Mandar Mitra, and Gerard Salton. New retrieval approaches using SMART: TREC 4. In Harman [6], pages 25–48.

    Google Scholar 

  5. Brendon Cahoon and Kathryn S. McKinley. Performance evaluation of a distributed architecture for information retrieval. In Hans-Peter Frei, Donna Harman, Peter Schäuble, and Ross Wilkinson, editors, Proc. 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 110–118, Zurich, Switzerland, August 1996.

    Google Scholar 

  6. D. K. Harman, editor. Proc. Fourth Text Retrieval Conference (TREC-4), Gaithersburg, MD, November 1995. U.S. National Institute of Standards and Technology.

    Google Scholar 

  7. Donna Harman and Gerald Candela. Retrieving records from a gigabyte of text on a minicomputer using statistical ranking. Journal of the American Society for Information Science, 41(8):581–589, 1990.

    Article  Google Scholar 

  8. David Hawking. The design and implementation of a parallel document retrieval engine. Technical Report TR-CS-95-08, Department of Computer Science, Australian National University, http://cs.anu.edu.au/techreports/1995/ index.html, 1995.

    Google Scholar 

  9. David Hawking. Document retrieval performance on parallel systems. In Proc. 1996 International Conference On Parallel and Distributed Processing Techniques and Applications, pages 1354–1365, Sunnyvale, California, August 1996. CSREA, Athens, Georgia.

    Google Scholar 

  10. David Hawking and Paul Thistlewaite. Relevance weighting using distance between term occurrences. Technical Report TR-CS-96-08, Department of Computer Science, Australian National University, http://cs.anu.edu.au/techreports/1996/ index. html, 1996.

    Google Scholar 

  11. David Hawking, Paul Thistlewaite, and Nick Craswell. TREC Very Large Collection (VLC) web page. http://pastime.anu.edu.au/TAR/vlc.html/, 1997.

    Google Scholar 

  12. Daniel Knaus, Elke Mittendorf, Peter Schduble, and Páraic Sheridan. Highlighting relevant passages for users of the interactive SPIDER retrieval system. In Harman [6], pages 233–244.

    Google Scholar 

  13. Alistair Moffat and Justin Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349–379, 1996.

    Article  Google Scholar 

  14. S. E. Robertson, S. Walker, M.M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In D. K. Harman, editor, Proc. Third Text Retrieval Conference (TREC3), Gaithersburg, MD, November 1994. U.S. National Institute of Standards and Technology. NIST special publication 500–225.

    Google Scholar 

  15. Amit Singhal, Gerard Salton, Mandar Mitra, and Chris Buckley. Document length normalization. Technical Report TR95-1529, Department of Computer Science, Cornell University, Ithaca, NY 14853, 1995.

    Google Scholar 

  16. David Sitsky, Paul Mackerras, Andrew Tridgell, and David Walsh. Implementing MPI under AP/Linux. In MPI Developer's Conference, July 1996.

    Google Scholar 

  17. Alan F. Smeaton, Fergus Kelledy, and Ruairi O'Donnell. TREC-4 experiments at Dublin City University: Thresholding posting lists, query expansion with WordNet and POS tagging of Spanish. In Harman [6], pages 373–389.

    Google Scholar 

  18. Craig Stanfill and Robert Thau. Information retrieval on the Connection Machine: 1 to 8192 gigabytes. Information Processing and Management, 27(4):285–310, 1991.

    Article  Google Scholar 

  19. University of Tennessee (Knoxville). MPI: A Message-Passing Interface Standard. http://www.epm.ornl.gov/ walker/mpi/index.html, 1995.

    Google Scholar 

  20. Ross Wilkinson, Justin Zobel, and Ron Sacks-Davis. Similarity measures for short queries. In Harman [6], pages 277–285.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Carol Peters Costantino Thanos

Rights and permissions

Reprints and permissions

Copyright information

© 1997 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hawking, D. (1997). Scalable text retrieval for large digital libraries. In: Peters, C., Thanos, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 1997. Lecture Notes in Computer Science, vol 1324. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026726

Download citation

  • DOI: https://doi.org/10.1007/BFb0026726

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-63554-3

  • Online ISBN: 978-3-540-69597-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics