Abstract
As of today, ‘web.archive.org’ has more than 338 billion web pages archived. How many of those pages are 100% retrieval. How many of the pages were left out or ignored just because the page doesn’t have some compatibility issue? How many of them were vernacular language and encoded in different formats (before UNICODE is standardized)? If we are talking about the content-type text. Consider other mime types which were encoded and decoded with different algorithms. The fundamental reason for this lies with the fundamental representation of digital data. We all know a sequence of 0 s and 1 s doesn’t make proper sense unless it is decoded properly. At the time of archiving, the browsers which could have rendered properly might have gone obsolete or upgraded way beyond to recognize old formats or the browser platforms could have been upgraded to recognize old formats. We studied various data preservation, web archiving related works and proposed a new framework that could store the exact client browser details (user-agent) in the WARC record and use it to load corresponding browser @ client side and render the archived content.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arunkumar, K., & Devendran, A. (2019). Digital data preservation—a viable solution. In V. Balas, N. Sharma, & A. Chakrabarti (Eds.), Data management, analytics and innovation. Advances in intelligent systems and computing (Vol. 808). Singapore: Springer.
Ainsworth, S. G., Nelson, M. L., & Van de Sompel, H. (2015). Only one out of five archived web pages existed as presented. In HT 2015 Proceedings of the 26th ACM Conference on Hypertext & Social Media (pp. 257–266).
Alam, S., Kelly, M., Weigle, M. C., & Nelson, M. L. (2017). Client-side reconstruction of composite mementos using serviceworker. In JCDL 2017 Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries (pp. 237–240).
Gomes, D., Miranda, J., & Costa M. (2011). A survey on web archiving initiatives. In S. Gradmann, F. Borri, C. Meghini, & H. Schuldt (Eds.), Research and advanced technology for digital libraries. TPDL 2011. Lecture Notes in Computer Science (Vol. 6966). Berlin, Heidelberg: Springer.
https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives.
Costa, M., Gomes, D., Couto, F. M., & Silva, M. J. (2013). A survey of web archive search architectures. In WWW 2013 Companion Proceedings of the 22nd International Conference on World Wide Web (pp. 1045–1050).
Kelly, M., Brunelle, J. F., Weigle, M. C., & Nelson, M. L. (2013). A method for identifying personalized representations in web archives. In D-Lib magazine November/December 2013 (Vol. 19, No. 11/12).
Banos, V., & Manolopoulos, Y. (2015). A quantitative approach to evaluate Website Archivability using the CLEAR+ method. International Journal on Digital Libraries. https://doi.org/10.1007/s00799-015-0144-4.
Kelly, M., & Nelson, M. & Weigle, M. (2018). A framework for aggregating private and public web archives (pp. 273–282). https://doi.org/10.1145/3197026.3197045.
Old browsers—a open source tool with remote & containerized browser system by oldweb-today. https://github.com/oldweb-today/browsers.
WebRecorder pywb 2.0—core python web archiving toolkit for replay and recording of web archives. https://github.com/webrecorder/pywb.
Turbo.net—a Cloud infrastructure to run instantly on all your desktops, mobile devices applications remotely. https://turbo.net/.
WARC format 1.1—WARC (Web ARChive) file format for archiving websites and web data. https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/.
RFC 7089—HTTP framework for time-based access to resource states—Memento. https://tools.ietf.org/html/rfc7089.
RFC 1945—HTTP with user-agent specification. https://tools.ietf.org/html/rfc1945.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Devendran, A., Arunkumar, K. (2020). A Framework for Web Archiving and Guaranteed Retrieval. In: Sharma, N., Chakrabarti, A., Balas, V. (eds) Data Management, Analytics and Innovation. Advances in Intelligent Systems and Computing, vol 1016. Springer, Singapore. https://doi.org/10.1007/978-981-13-9364-8_16
Download citation
DOI: https://doi.org/10.1007/978-981-13-9364-8_16
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9363-1
Online ISBN: 978-981-13-9364-8
eBook Packages: EngineeringEngineering (R0)