Scalable text retrieval for large digital libraries

Hawking, David

doi:10.1007/BFb0026726

Scalable text retrieval for large digital libraries

David Hawking¹

Information Retrieval I
Conference paper
First Online: 01 January 2005

130 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1324))

Abstract

It is argued that digital libraries of the future will contain terabyte-scale collections of digital text and that full-text searching techniques will be required to operate over collections of this magnitude. Algorithms expected to be capable of scaling to these data sizes using clusters of modern workstations are described. First, basic indexing and retrieval algorithms operating at performance levels comparable to other leading systems over gigabytes of text on a single workstation are presented. Next, simple mechanisms for extending query processing capacity to much greater collection sizes are presented, to tens of gigabytes for single workstations and to terabytes for clusters of such workstations. Query-processing efficiency on a single workstation is shown to deteriorate dramatically when data size is increased above a certain multiple of physical memory size. By contrast, the number of clustered workstations necessary to maintain a constant level of service increases linearly with increasing data size. Experiments using clusters of up to 16 workstations are reported. A non-replicated 20 gigabyte collection was indexed in just over 5 hours using a ten workstation cluster and scalability results are presented for query processing over replicated collections of up to 102 gigabytes.

The author wishes to acknowledge that this work was carried out within the Cooperative Research Centre For Advanced Computational Systems established under the Australian Government's Cooperative Research Centre's Program.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

References

James Allan, Lisa Ballesteros, James P. Callan, W. Bruce Croft, and Zhihong Lu. Recent experiments with INQUERY. In Harman [6], pages 49–63.
Google Scholar
T.C. Bell, A. Moffat, and I. H. Witten. Compressing the digital library. In Proc. Digital Libraries 94, 1994.
Google Scholar
Eric W. Brown. Fast evaluation of structured queries for information retrieval. In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, Proc. 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 30–38, Seattle, Washington, July 1995. ACM Press.
Google Scholar
Chris Buckley, Amit Singhal, Mandar Mitra, and Gerard Salton. New retrieval approaches using SMART: TREC 4. In Harman [6], pages 25–48.
Google Scholar
Brendon Cahoon and Kathryn S. McKinley. Performance evaluation of a distributed architecture for information retrieval. In Hans-Peter Frei, Donna Harman, Peter Schäuble, and Ross Wilkinson, editors, Proc. 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 110–118, Zurich, Switzerland, August 1996.
Google Scholar
D. K. Harman, editor. Proc. Fourth Text Retrieval Conference (TREC-4), Gaithersburg, MD, November 1995. U.S. National Institute of Standards and Technology.
Google Scholar
Donna Harman and Gerald Candela. Retrieving records from a gigabyte of text on a minicomputer using statistical ranking. Journal of the American Society for Information Science, 41(8):581–589, 1990.
Article Google Scholar
David Hawking. The design and implementation of a parallel document retrieval engine. Technical Report TR-CS-95-08, Department of Computer Science, Australian National University, http://cs.anu.edu.au/techreports/1995/ index.html, 1995.
Google Scholar
David Hawking. Document retrieval performance on parallel systems. In Proc. 1996 International Conference On Parallel and Distributed Processing Techniques and Applications, pages 1354–1365, Sunnyvale, California, August 1996. CSREA, Athens, Georgia.
Google Scholar
David Hawking and Paul Thistlewaite. Relevance weighting using distance between term occurrences. Technical Report TR-CS-96-08, Department of Computer Science, Australian National University, http://cs.anu.edu.au/techreports/1996/ index. html, 1996.
Google Scholar
David Hawking, Paul Thistlewaite, and Nick Craswell. TREC Very Large Collection (VLC) web page. http://pastime.anu.edu.au/TAR/vlc.html/, 1997.
Google Scholar
Daniel Knaus, Elke Mittendorf, Peter Schduble, and Páraic Sheridan. Highlighting relevant passages for users of the interactive SPIDER retrieval system. In Harman [6], pages 233–244.
Google Scholar
Alistair Moffat and Justin Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349–379, 1996.
Article Google Scholar
S. E. Robertson, S. Walker, M.M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In D. K. Harman, editor, Proc. Third Text Retrieval Conference (TREC3), Gaithersburg, MD, November 1994. U.S. National Institute of Standards and Technology. NIST special publication 500–225.
Google Scholar
Amit Singhal, Gerard Salton, Mandar Mitra, and Chris Buckley. Document length normalization. Technical Report TR95-1529, Department of Computer Science, Cornell University, Ithaca, NY 14853, 1995.
Google Scholar
David Sitsky, Paul Mackerras, Andrew Tridgell, and David Walsh. Implementing MPI under AP/Linux. In MPI Developer's Conference, July 1996.
Google Scholar
Alan F. Smeaton, Fergus Kelledy, and Ruairi O'Donnell. TREC-4 experiments at Dublin City University: Thresholding posting lists, query expansion with WordNet and POS tagging of Spanish. In Harman [6], pages 373–389.
Google Scholar
Craig Stanfill and Robert Thau. Information retrieval on the Connection Machine: 1 to 8192 gigabytes. Information Processing and Management, 27(4):285–310, 1991.
Article Google Scholar
University of Tennessee (Knoxville). MPI: A Message-Passing Interface Standard. http://www.epm.ornl.gov/ walker/mpi/index.html, 1995.
Google Scholar
Ross Wilkinson, Justin Zobel, and Ron Sacks-Davis. Similarity measures for short queries. In Harman [6], pages 277–285.
Google Scholar

Download references

Author information

Authors and Affiliations

Co-operative Research Centre For Advanced Computational Systems Department Of Computer Science, Australian National University, Australia
David Hawking

Authors

David Hawking
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Carol Peters Costantino Thanos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hawking, D. (1997). Scalable text retrieval for large digital libraries. In: Peters, C., Thanos, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 1997. Lecture Notes in Computer Science, vol 1324. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026726

Download citation

DOI: https://doi.org/10.1007/BFb0026726
Published: 17 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63554-3
Online ISBN: 978-3-540-69597-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics