Skip to main content

InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives

  • Conference paper
  • First Online:
Research and Advanced Technology for Digital Libraries (TPDL 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9819))

Included in the following conference series:

Abstract

We have integrated Web ARChive (WARC) files with the peer-to-peer content addressable InterPlanetary File System (IPFS) to allow the payload content of web archives to be easily propagated. We also provide an archival replay system extended from pywb to fetch the WARC content from IPFS and re-assemble the originally archived HTTP responses for replay. From a 1.0 GB sample Archive-It collection of WARCs containing 21,994 mementos, we show that extracting and indexing the HTTP response content of WARCs containing IPFS lookup hashes takes 66.6 min inclusive of dissemination into IPFS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/oduwsdl/ipwb.

  2. 2.

    https://github.com/iipc/openwayback.

  3. 3.

    https://github.com/ikreymer/pywb.

  4. 4.

    https://archive-it.org/collections/2438.

  5. 5.

    https://github.com/ipfs/go-ipfs/issues/1216.

  6. 6.

    http://archivesunleashed.ca.

References

  1. Alam, S.: CDXJ: an object resource stream serialization format, September 2015. http://ws-dl.blogspot.com/2015/09/2015-09-10-cdxj-object-resource-stream.html

  2. Benet, J.: IPFS - content addressed, version, P2P file system. Technical report, July 2014. arXiv:1407.3561

  3. Fielding, R., Reschke, J.: Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing. IETF RFC 7230, June 2014

    Google Scholar 

  4. ISO 28500. WARC (Web ARChive) file format, August 2009. http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml

  5. Maniatis, P., Roussopoulos, M., Giuli, T.J., Rosenthal, D.S.H., Baker, M.: The LOCKSS peer-to-peer digital preservation system. ACM Trans. Comput. Syst. 23(1), 2–50 (2005)

    Article  Google Scholar 

  6. Moats, R.: URN Syntax. IETF RFC 2141, May 1997

    Google Scholar 

  7. Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop (IWAW 2004), September 2004

    Google Scholar 

  8. Sigurðsson, K.: Managing duplicates across sequential crawls. In: Proceedings of the 6th International Web Archiving Workshop (IWAW 2006), September 2006

    Google Scholar 

  9. Van de Sompel, H., Nelson, M., Sanderson, R.: HTTP Framework for Time-Based Access to Resource States - Memento. IETF RFC 7089, December 2013

    Google Scholar 

Download references

Acknowledgements

We would like to thank Ilya Kreymer for his feedback during the development of the ipwb prototype and guidance in interfacing with the pywb replay system. This work was supported in part by NSF award 1624067 via the Archives Unleashed HackathonFootnote 6, where we developed the prototype.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mat Kelly .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Kelly, M., Alam, S., Nelson, M.L., Weigle, M.C. (2016). InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2016. Lecture Notes in Computer Science(), vol 9819. Springer, Cham. https://doi.org/10.1007/978-3-319-43997-6_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-43997-6_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-43996-9

  • Online ISBN: 978-3-319-43997-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics