Skip to main content

Harvesting for Full-Text Retrieval

  • Conference paper
  • 1132 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3815))

Abstract

We propose an approach to Distributed Information Retrieval based on the periodic and incremental centralisation of full-text indices of widely dispersed and autonomously managed content sources.

Inspired by the success of the Open Archive Initiative’s protocol for metadata harvesting, the approach occupies middle ground between: (i) the crawling of content, and (ii) the distribution of retrieval. As in crawling, some data moves towards the retrieval process, but it is statistics about the content rather than content itself. As in distributed retrieval, some processing is distributed along with the data, but it is indexing rather than retrieval itself. We show that the approach retains the good properties of centralised retrieval without renouncing to cost-effective resource pooling. We discuss the requirements associated with the approach and identify two strategies to deploy it on top of the OAI infrastructure.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bowman, C.M., Danzig, P.B., Hardy, D.R., et al.: Harvest: A Scalable, Customizable, Discovery and Access System. Technical Report TR CU-CS-732-94, Department of Computer Science, University of Colorado-Boulder (1994)

    Google Scholar 

  2. Callan, J.: Distributed information retrieval. In: Croft, W.B. (ed.) Advances in information retrieval, ch. 5, pp. 127–150. Kluwer Academic Publishers, Dordrecht (2000)

    Google Scholar 

  3. Callan, J., Fuhr, N., Nejdl, W. (eds.): Proceedings of the SIGIR Workshop on Peer-to-Peer Information Retrieval, 27th Annual International ACM SIGIR Conference, July 29 (2004)

    Google Scholar 

  4. Carmel, D., Cohen, D., et al.: Static Index Pruning for Information Retrieval Systems. In: Proceedings of the 24th ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–50 (2001)

    Google Scholar 

  5. The Dublin Core Metadata Initiative: Dublin Core Metadata Element Set, Version 1.1: Reference Description (2004), http://dublincore.org/documents/dces/

  6. Lagoze, C., Van de Sompel, H.: The Open Archives Initiative: Building a low-barrier interoperability framework. In: JCDL 2001: Proceedings of the First ACM/IEEE-CS Joint Conference on Digital Libraries (2001)

    Google Scholar 

  7. Lagoze, C., Hoehn, W., Arms, W., Allan, J., et al.: Core Services in the Architecture of the National Digital Library for Science Education (NDSL). Cornell University, Ithaca, arXiv Report, cs.DL/0201025 (2002)

    Google Scholar 

  8. Lynch, C.: The Z39.50 Information Retrieval Standard: Part I: A Strategic View of Its Past, Present, and Future. In: D-Lib Magazine (April 1997), http://www.dlib.org/dlib/april97/04lynch.html

  9. The Open Archives Initiative: The Open Archives Initiative Protocol for Metadata Harvesting (2.0) (2003), http://www.openarchives.org/OAI/openarchivesprotocol.html

  10. Simeoni, F.: Servicing the Federation: the Case for Metadata Harvesting. In: Heery, R., Lyon, L. (eds.) ECDL 2004. LNCS, vol. 3232, pp. 389–399. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  11. Van de Sompel, H., Young, J., Hickey, T.: Using the OAI-PMH..Differently. In: D-lib Magazine (July/August 2003)

    Google Scholar 

  12. Suleman, H., Fox, E.: Designing Protocols in Support of Digital Library Componentization. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 568–582. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  13. Witten, I., Moffat, A., Bell, T.: Managing Gigabytes: Compressing and indexing documents and images. Van Nostrand Reinhold (1994)

    Google Scholar 

  14. Z39.50 Maintenance Agency: Information Retrieval (Z39.50): Application Service Definition and Protocol Specification (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Simeoni, F., Yakici, M., Neely, S., Crestani, F. (2005). Harvesting for Full-Text Retrieval. In: Fox, E.A., Neuhold, E.J., Premsmit, P., Wuwongse, V. (eds) Digital Libraries: Implementing Strategies and Sharing Experiences. ICADL 2005. Lecture Notes in Computer Science, vol 3815. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11599517_24

Download citation

  • DOI: https://doi.org/10.1007/11599517_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-30850-8

  • Online ISBN: 978-3-540-32291-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics