Abstract
Metadata harvesting has become a common technique to transfer a stream of data from one metadata repository or digital library system to another. As collections of metadata, and their associated digital objects, grow in size, the ingest of these items at the destination archive can take a significant amount of time, depending on the type of indexing or post-processing that is required. This paper discusses an approach to parallelise the post-processing of data in a small cluster of machines or a multi-processor environment, while not increasing the burden on the source data provider. Performance tests have been carried out on varying architectures and the results indicate that this technique is indeed promising for some scenarios and can be extended to more computationally-intensive ingest procedures. In general, the technique presents a new approach for the construction of harvest-based distributed or component-based digital libraries, with better scalability than before.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Andresen, D., Yang, T., Egecioglu, O., Ibarra, O.H., Smith, T.R.: Scalability Issues for High Performance Digital Libraries on the World Wide Web. Technical Report 1996-03, Department of Computer Science, University of California Santa Barbara (March 1996)
Bar, M.: openMosix, a Linux Kernel Extension for Single System Image Clustering. In: Proceedings of Linux Kongress: 10th International Linux System Technology Conference, October 15-16, 2003, Saarbrücken, Germany (2003)
Brown, R.G.: Engineering a Beowulf-style Compute Cluster, Duke University Physics Department (2004), available http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book/index.html
Diligent: A Digital Library Infrastructure on Grid Enabled Technology (2006), Website http://www.diligentproject.org/
Dongarra, J., Kennedy, K., White, A.: Introduction. In: Dongarra, J., Foster, I., Fox, G., Gropp, W., Kennedy, K., Torczon, L., White, A. (eds.) Sourcebook of Parallel Computing, Morgan Kaufman, Amsterdam (2003)
Haedstrom, M.: Research Challenges in Digital Archiving and Long-term Preservation. In: NSF Post Digital Library Futures Workshop, June 15-17, 2003, Cape Cod (2003), available http://www.sis.pitt.edu/~dlwkshop/paper_hedstrom.html
Imafouo, A.: A Scalability Survey in IR and DL. TCDL Bulletin 2(2) (2006), http://www.ieee-tcdl.org/Bulletin/v2n2/imafouo/imafouo.html
Lagoze, C., Van de Sompel, H.: The Open Archives Initiative: Building a low-barrier interoperability framework. In: Proceedings of the ACM-IEEE Joint Conference on Digital Libraries, Roanoke, VA, USA, June 24-28, 2001, pp. 54–62 (2001)
Lagoze, C., Van de Sompel, H., Nelson, M., Warner, S.: The Open Archives Initiative Protocol for Metadata Harvesting – Version 2.0, Open Archives Initiative (June 2002), available http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm
Lyman, P., Varian, H.R.: How Much Information 2003? University of California (2003), available http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/index.htm
Wilkinson, B., Allen, M.: Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Prentice Hall, New Jersey (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Suleman, H. (2006). Parallelising Harvesting. In: Sugimoto, S., Hunter, J., Rauber, A., Morishima, A. (eds) Digital Libraries: Achievements, Challenges and Opportunities. ICADL 2006. Lecture Notes in Computer Science, vol 4312. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11931584_11
Download citation
DOI: https://doi.org/10.1007/11931584_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49375-4
Online ISBN: 978-3-540-49377-8
eBook Packages: Computer ScienceComputer Science (R0)