Abstract
It is argued that digital libraries of the future will contain terabyte-scale collections of digital text and that full-text searching techniques will be required to operate over collections of this magnitude. Algorithms expected to be capable of scaling to these data sizes using clusters of modern workstations are described. First, basic indexing and retrieval algorithms operating at performance levels comparable to other leading systems over gigabytes of text on a single workstation are presented. Next, simple mechanisms for extending query processing capacity to much greater collection sizes are presented, to tens of gigabytes for single workstations and to terabytes for clusters of such workstations. Query-processing efficiency on a single workstation is shown to deteriorate dramatically when data size is increased above a certain multiple of physical memory size. By contrast, the number of clustered workstations necessary to maintain a constant level of service increases linearly with increasing data size. Experiments using clusters of up to 16 workstations are reported. A non-replicated 20 gigabyte collection was indexed in just over 5 hours using a ten workstation cluster and scalability results are presented for query processing over replicated collections of up to 102 gigabytes.
The author wishes to acknowledge that this work was carried out within the Cooperative Research Centre For Advanced Computational Systems established under the Australian Government's Cooperative Research Centre's Program.
This is a preview of subscription content, log in via an institution.
Preview
Unable to display preview. Download preview PDF.
References
James Allan, Lisa Ballesteros, James P. Callan, W. Bruce Croft, and Zhihong Lu. Recent experiments with INQUERY. In Harman [6], pages 49–63.
T.C. Bell, A. Moffat, and I. H. Witten. Compressing the digital library. In Proc. Digital Libraries 94, 1994.
Eric W. Brown. Fast evaluation of structured queries for information retrieval. In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, Proc. 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 30–38, Seattle, Washington, July 1995. ACM Press.
Chris Buckley, Amit Singhal, Mandar Mitra, and Gerard Salton. New retrieval approaches using SMART: TREC 4. In Harman [6], pages 25–48.
Brendon Cahoon and Kathryn S. McKinley. Performance evaluation of a distributed architecture for information retrieval. In Hans-Peter Frei, Donna Harman, Peter Schäuble, and Ross Wilkinson, editors, Proc. 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 110–118, Zurich, Switzerland, August 1996.
D. K. Harman, editor. Proc. Fourth Text Retrieval Conference (TREC-4), Gaithersburg, MD, November 1995. U.S. National Institute of Standards and Technology.
Donna Harman and Gerald Candela. Retrieving records from a gigabyte of text on a minicomputer using statistical ranking. Journal of the American Society for Information Science, 41(8):581–589, 1990.
David Hawking. The design and implementation of a parallel document retrieval engine. Technical Report TR-CS-95-08, Department of Computer Science, Australian National University, http://cs.anu.edu.au/techreports/1995/ index.html, 1995.
David Hawking. Document retrieval performance on parallel systems. In Proc. 1996 International Conference On Parallel and Distributed Processing Techniques and Applications, pages 1354–1365, Sunnyvale, California, August 1996. CSREA, Athens, Georgia.
David Hawking and Paul Thistlewaite. Relevance weighting using distance between term occurrences. Technical Report TR-CS-96-08, Department of Computer Science, Australian National University, http://cs.anu.edu.au/techreports/1996/ index. html, 1996.
David Hawking, Paul Thistlewaite, and Nick Craswell. TREC Very Large Collection (VLC) web page. http://pastime.anu.edu.au/TAR/vlc.html/, 1997.
Daniel Knaus, Elke Mittendorf, Peter Schduble, and Páraic Sheridan. Highlighting relevant passages for users of the interactive SPIDER retrieval system. In Harman [6], pages 233–244.
Alistair Moffat and Justin Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349–379, 1996.
S. E. Robertson, S. Walker, M.M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In D. K. Harman, editor, Proc. Third Text Retrieval Conference (TREC3), Gaithersburg, MD, November 1994. U.S. National Institute of Standards and Technology. NIST special publication 500–225.
Amit Singhal, Gerard Salton, Mandar Mitra, and Chris Buckley. Document length normalization. Technical Report TR95-1529, Department of Computer Science, Cornell University, Ithaca, NY 14853, 1995.
David Sitsky, Paul Mackerras, Andrew Tridgell, and David Walsh. Implementing MPI under AP/Linux. In MPI Developer's Conference, July 1996.
Alan F. Smeaton, Fergus Kelledy, and Ruairi O'Donnell. TREC-4 experiments at Dublin City University: Thresholding posting lists, query expansion with WordNet and POS tagging of Spanish. In Harman [6], pages 373–389.
Craig Stanfill and Robert Thau. Information retrieval on the Connection Machine: 1 to 8192 gigabytes. Information Processing and Management, 27(4):285–310, 1991.
University of Tennessee (Knoxville). MPI: A Message-Passing Interface Standard. http://www.epm.ornl.gov/ walker/mpi/index.html, 1995.
Ross Wilkinson, Justin Zobel, and Ron Sacks-Davis. Similarity measures for short queries. In Harman [6], pages 277–285.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hawking, D. (1997). Scalable text retrieval for large digital libraries. In: Peters, C., Thanos, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 1997. Lecture Notes in Computer Science, vol 1324. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026726
Download citation
DOI: https://doi.org/10.1007/BFb0026726
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63554-3
Online ISBN: 978-3-540-69597-4
eBook Packages: Springer Book Archive