Abstract
To date, most of the focus regarding digital preservation has been on replicating copies of the resources to be preserved from the “living web” and placing them in an archive for controlled curation. Once inside an archive, the resources are subject to careful processes of refreshing (making additional copies to new media) and migrating (conversion to new formats and applications). For small numbers of resources of known value, this is a practical and worthwhile approach to digital preservation. However, due to the infrastructure costs (storage, networks, machines) and more importantly the human management costs, this approach is unsuitable for web scale preservation. The result is that difficult decisions need to be made as to what is saved and what is not saved. We provide an overview of our ongoing research projects that focus on using the “web infrastructure” to provide preservation capabilities for web pages and examine the overlap these approaches have with the field of information retrieval. The common characteristic of the projects is they creatively employ the web infrastructure to provide shallow but broad preservation capability for all web pages. These approaches are not intended to replace conventional archiving approaches, but rather they focus on providing at least some form of archival capability for the mass of web pages that may prove to have value in the future. We characterize the preservation approaches by the level of effort required by the web administrator: web sites are reconstructed from the caches of search engines (“lazy preservation”); lexical signatures are used to find the same or similar pages elsewhere on the web (“just-in-time preservation”); resources are pushed to other sites using NNTP newsgroups and SMTP email attachments (“shared infrastructure preservation”); and an Apache module is used to provide OAI-PMH access to MPEG-21 DIDL representations of web pages (“web server enhanced preservation”).
Similar content being viewed by others
References
GNU wget GNU Project Free Software Foundation (FSF). URL: http://www.gnu.org/software/wget/wget.html
Abiteboul, S., Cobena, G., Masanes, J., Sedrati, G.: A first experience in archiving the French web. In: ECDL ’02: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, pp. 1–15 (2002)
Arms, W.Y., Aya, S., Dmitriev, P., Kot, B.J., Mitchell, R., Walle, L.: Building a research library for the history of the web. In: JCDL ’06: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 95–102. doi:10.1145/1141753.1141771 (2006)
Baeza-Yates, R., Castillo, C.: Crawling the infinite web: five levels are enough. In: Proceedings of the Third Workshop on Web Graphs (WAW), vol. 3243, pp. 156–167 (2004)
Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic transit gloria telae: towards an understanding of the web’s decay. In: WWW ’04: Proceedings of the 13th International Conference on World Wide Web, pp. 328–337. doi:10.1145/988672.988716 (2004)
Beck, M., Moore, T., Plank, J.S.: An end-to-end approach to globally scalable network storage. In: SIGCOMM ’02: Proceedings of the 2002 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 339–346. doi:10.1145/633025.633058 (2002)
Bekaert J., De Kooning E. and Vande Sompel H. (2006). Representing digital objects using MPEG-21 Digital Item Declaration. Int. J. Digital Libraries 6(2): 159–173. doi:10.1007/s00799-005-0133-0
Bekaert, J., Hochstenbach, P., Van de Sompel, H.: Using MPEG-21 DIDL to represent complex digital objects in the Los Alamos National Laboratory digital library. D-Lib Magaz. 9(11) (2003). doi:10.1045/november2003-bekaert
Bekaert, J., Liu, X., Van de Sompel, H.: aDORe: a modular and standards-based digital object repository at the Los Alamos National Laboratory. In: JCDL ’05: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, p. 367. doi:10.1145/1065385.1065470 (2005)
Bergman, M.K.: The deep web: surfacing hidden value. J. Electron. Publishing 7(1) (2001). URL: http://www.press.umich.edu/ jep/07-01/bergman.html
Bergmark, D., Lagoze, C., Sbityakov, A.: Focused crawls, tunneling, and digital libraries. In: ECDL ’02: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, pp. 91–106 (2002)
Berners-Lee, T.: Cool URIs don’t change (1998). http://www.w3. org/Provider/Style/URI.html
Bharat, K., Broder, A.: Mirror, mirror on the web: a study of host pairs with replicated content. In: Proceedings of WWW ’99, pp. 1579–1590. doi:10.1016/S1389-1286(99)00021-3 (1999)
Brandman O., Cho J., Garcia-Molina H. and Shivakumar N. (2000). Crawler-friendly web servers. SIGMETRICS Perform. Eval. Rev 28(2): 9–14. doi:10.1145/362883.362894
Broder, A.Z., Najork, M., Wiener, J.L.: Efficient URL caching for World Wide Web crawling. In: Proceedings of WWW ’03, pp. 679–689. doi:10.1145/775152.775247 (2003)
Chen P.M., Lee E.K., Gibson G.A., Katz R.H. and Patterson D.A. (1994). RAID: high-performance, reliable secondary storage. ACM Comput. Surv. 26(2): 145–185. doi:10.1145/176979.176981
Chen, Y., Edler, J., Goldberg, A., Gottlieb, A., Sobti, S., Yianilos, P.: A prototype implementation of archival intermemory. In: DL ’99: Proceedings of the Fourth ACM Conference on Digital Libraries, pp. 28–37. doi:10.1145/313238.313249 (1999)
Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: Proceedings of VLDB ’00, pp. 200–209 (2000)
Cho, J., Garcia-Molina, H.: Parallel crawlers. In: WWW ’02: Proceedings of the 11th International Conference on World Wide Web, pp. 124–135. doi:10.1145/511446.511464 (2002)
Cho J. and Garcia-Molina H. (2003). Effective page refresh policies for web crawlers. ACM Trans. Database Systems (TODS) 28(4): 390–426. doi:10.1145/958942.958945
Cho J. and Garcia-Molina H. (2003). Estimating frequency of change. ACM Trans. Internet Technol. 3(3): 256–290. doi:10.1145/ 857166.857170
Cho J., Garcia-Molina H., Haveliwala T., Lam W., Paepcke A., Raghavan S. and Wesley G. (2006). Stanford Webbase components and applications. ACM Trans. Internet Technol 6(2): 153–186. doi: 10.1145/1149121.1149124
Cho J., Garcia-Molina H. and Page L. (1998). Efficient crawling through url ordering. Comput. Netw. ISDN Systems 30(1–7): 161–172
Cho, J., Shivakumar, N., Garcia-Molina, H.: Finding replicated web collections. In: SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 355–366. doi:10.1145/342009.335429 (2000)
Christensen, N.: Preserving the bits of the Danish Internet. In: 5th International Web Archiving Workshop (IWAW05) (2005). http://www.iwaw.net/05/papers/iwaw05-christensen.pdf
Clarke I., Miller S.G., Hong T.W., Sandberg O. and Wiley B. (2002). Protecting free expression online with Freenet. IEEE Internet Comput. 6(1): 40–49. doi:10.1109/4236.978368
Consultative Committee for Space Data Systems: Reference model for an open archival information system (OAIS). Tech. rep. (2002)
Cooper, B., Crespo, A., Garcia-Molina, H.: Implementing a reliable digital object archive. In: ECDL ’00: Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries, pp. 128–143 (2000)
Cooper B.F. and Garcia-Molina H. (2002). Peer-to-peer data trading to preserve information. ACM Trans. Inf. Systems (TOIS) 20(2): 133–170. doi:10.1145/506309.506310
Cooper B.F. and Garcia-Molina H. (2005). Infomonitor: Unobtrusively archiving a World Wide Web server. Int. J. Digital Libraries 5(2): 106–119
Dabek, F., Kaashoek, M.F., Karger, D., Morris, R., Stoica, I.: Wide-area cooperative storage with CFS. In: Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP ’01) (2001)
Day, M.: Collecting and preserving the World Wide Web (2003). URL: http://library.wellcome.ac.uk/assets/WTL039229.pdf
Dingledine, R., Freedman, M.J., Molnar, D.: The Free Haven project: distributed anonymous storage service. In: International Workshop on Designing Privacy Enhancing Technologies, pp. 67–95 (2001)
Dyreson, C.E., Lin, H., Wang, Y.: Managing versions of web documents in a transaction-time web server. In: WWW ’04: Proceedings of the 13th International Conference on World Wide Web, pp. 422–432 (2004). doi:10.1145/988672.988730
E.G. Coffman J., Liu Z. and Weber R.R. (1998). Optimal robot scheduling for web search engines. J. Scheduling 1(1): 15–29
Edwards, J., McCurley, K., Tomlin, J.: An adaptive model for optimizing performance of an incremental web crawler. In: WWW ’01: Proceedings of the 10th International Conference on World Wide Web, pp. 106–113 (2001). doi:10.1145/371920.371960
Feise, J.: An approach to persistence of web resources. In: HYPERTEXT ’01: Proceedings of the 12th ACM Conference on Hypertext and Hypermedia, pp. 215–216 (2001). doi:10.1145/504216.504267
Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: WebDB ’04: Proceedings of the 7th International Workshop on the Web and Databases, pp. 1–6 (2004). doi:10.1145/1017074.1017077
Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. In: WWW ’03: Proceedings of the 12th International Conference on World Wide Web, pp. 669–678 (2003). doi:10.1145/775152.775246
Fielding, R.T.: Architectural styles and the design of network-based software architectures. Ph.D. thesis, University of California, Irvine Department of Computer Science (2000). URL: http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm
Gladney H.M. (2004). Trustworthy 100-year digital objects: evidence after every witness is dead. ACM Trans. Inf. Systems (TOIS) 22(3): 406–436. doi:10.1145/1010614.1010617
Gulli, A., Signorini, A.: The indexable web is more than 11.5 billion pages. In: WWW ’05: Proceedings of the 14th International Conference on World Wide Web, pp. 902–903 (2005). doi:10.1145/1062745.1062789
Gupta, V., Campbell, R.: Internet search engine freshness by web server help. In: SAINT ’01: Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001), pp. 113–119 (2001)
Gutteridge, C., Harnad, S.: Applications, potential problems and a suggested policy for institutional e-print archives. Tech. Rep. 6768, University of Southampton, Intelligence, Agents, Multimedia Systems Group (2002)
Gyongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with TrustRank. In: Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pp. 271–279 (2004)
Hafri, Y., Djeraba, C.: High performance crawling system. In: MIR ’04: Proceedings of the 6th ACM SIGMM International Workshop on Multimedia Information retrieval, pp. 299–306 (2004). doi:10.1145/1026711.1026760
Hammond, T., Hannay, T., Lund, B., Scott, J.: Social bookmarking tools (I): a general review. D-Lib Magaz. 11(4) (2005). doi:10.1045/april2005-hammond
Harrison, T.L.: Opal: In vivo based preservation framework for locating lost web pages. Master’s thesis, Old Dominion University (2005). URL:http://www.cs.odu.edu/~tharriso/thesis/
Harrison, T.L., Nelson, M.L.: Just-in-time recovery of missing web pages. In: HYPERTEXT ’06: Proceedings of the Seventeenth ACM Conference on Hypertext and Hypermedia (2006)
Kahle B. (1997). Preserving the Internet. Sci. Am. 276(3): 82–83
Kantor, B., Lapsley, P.: Network news transfer protocol (1986)
Koehler W. (2002). Web page change and persistence—a four-year longitudinal study. J. Am. Soc. Inf. Sci. Technol. 53(2): 162–171. doi:10.1002/asi.10018
Lagoze, C., Arms, W., Gan, S., Hillmann, D., Ingram, C., Krafft, D., Marisa, R., Phipps, J., Saylor, J., Terrizzi, C., Hoehn, W., Millman, D., Allan, J., Guzman-Lara, S., Kalt, T.: Core services in the architecture of the national science digital library (NSDL). In: JCDL ’02: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 201–209 (2002). doi:10.1145/544220.544264
Lagoze, C., Van de Sompel, H.: The Open Archives Initiative: building a low-barrier interoperability framework. In: JCDL ’01: Proceedings of the 1st ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 54–62 (2001). doi:10.1145/379437.379449
Lampos, C., Eirinaki, M., Jevtuchova, D., Vazirgiannis, M.: Archiving the Greek Web. In: 4th International Web Archiving Workshop (IWAW04) (2004)
Lannom, L.: Handle system overview. ICSTI Forum (30) (1999). URL: http://www.icsti.org/forum/30/
Lawrence S., Giles C.L. and Bollacker K. (1999). Digital libraries and autonomous citation indexing. IEEE Comput. 32(6): 67–71. doi:10.1109/2.769447
Lawrence S., Pennock D.M., Flake G.W., Krovetz R., Etzee F.M.C., Glover E., Nielsen F., Kruger A. and Giles C.L. (2001). Persistence of web references in scientific research. IEEE Computer 34(2): 26–31
Maniatis P., Roussopoulos M., Giuli T.J., Rosenthal D.S.H. and Baker M. (2005). The LOCKSS peer-to-peer digital preservation system. ACM Trans. Comput. Systems 23(1): 2–50. doi:10.1145/1047915.1047917
Marcum, D.B.: We can’t save everything. CLIR Issues (5) (1998). http://www.clir.org/pubs/issues/issues05.html
Marill, J.L., Boyko, A., Ashenfelder, M., Graham, L.: Tools and techniques for harvesting the World Wide Web. In: JCDL ’04: Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, p. 403 (2004). doi:10.1145/996350.996469
Masanès, J.: Archiving the deep web. In: Proceedings of the 2nd International Web Archiving Workshop (IWAW’02) (2002)
McCown, F., Chan, S., Nelson, M.L., Bollen, J.: The availability and persistence of web references in D-Lib Magazine. In: 5th International Web Archiving Workshop (IWAW’05) (2005). URL: http://www.iwaw.net/05/papers/iwaw05-mccown1.pdf
McCown, F., Nelson, M.L.: Evaluation of crawling policies for a web-repository crawler. In: HYPERTEXT ’06: Proceedings of the Seventeenth ACM Conference on Hypertext and Hypermedia, pp 145–156 (2006). doi:10.1145/1149941.1149972
McCown, F., Smith, J.A., Nelson, M.L., Bollen, J.: Reconstructing websites for the lazy webmaster. Tech. Rep. arXiv cs.IR/0512069 (2005). http://arxiv.org/abs/cs.IR/0512069
McCown, F., Smith, J.A., Nelson, M.L., Bollen, J.: Lazy preservation: Reconstructing websites by crawling the crawlers. In: WIDM ’06: Proceedings of the 8th Annual ACM International Workshop on Web Information and Data Management (2006)
McDonough J.P. (2006). METS: Standardized encoding for digital library objects. Int. J. Digital Libraries 6(2): 148–158. doi:10.1007/s00799-005-0132-1
Menczer, F., Pant, G., Srinivasan, P., Ruiz, M.E.: Evaluating topic-driven web crawlers. In: SIGIR ’01: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 241–249 (2001). doi:10.1145/383952.383995
Mitra, N.: SOAP version 1.2 part 0: Primer. Tech. rep., W3C (2003). URL: http://www.w3.org/TR/soap12-part0/
Nelson, M.L., Allen, B.D.: Object persistence and availability in digital libraries. D-Lib Magaz. 8(1) (2002). doi:10.1045/ january2002-nelson
Nelson, M.L., Bollen, J., Manepalli, G., Haq, R.: Archive ingest and handling test: The Old Dominion University approach. D-Lib Magaz. 11(12) (2005). doi:10.1045/december2005-nelson
Nelson, M.L., Smith, J.A., del Campo, I.G., Van de Sompel, H., Liu, X.: Efficient, automatic web resource harvesting. In: WIDM ’06: Proceedings of the 8th Annual ACM International Workshop on Web Information and Data Management (2006)
Nelson, M.L., Van de Sompel, H., Liu, X., Harrison, T.L.: mod_oai: an Apache module for metadata harvesting. Tech. rep., Old Dominion University (2005). ArXiv cs.DL/0503069
Nelson, M.L., Van de Sompel, H., Liu, X., Harrison, T.L., McFarland, N.: mod_oai: an Apache module for metadata harvesting. In: ECDL ’05: Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries, pp. 509–510 (2005)
Ntoulas, A., Cho, J., Olston, C.: What’s new on the Web? The evolution of the Web from a search engine perspective. In: WWW ’04: Proceedings of the 13th International Conference on World Wide Web, pp. 1–12 (2004). doi:10.1145/988672.988674
Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: JCDL ’05: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 100–109 (2005). doi:10.1145/1065385.1065407
Pandey, S., Roy, S., Olston, C., Cho, J., Chakrabarti, S.: Shuffling a stacked deck: the case for partially randomized ranking of search engine results. In: VLDB ’05: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 781–792 (2005)
Park, S.T., Pennock, D.M., Giles, C.L., Krovetz, R.: Analysis of lexical signatures for improving information persistence on the World Wide Web. ACM Trans. Inf. Systems 22(4), 540–572 (2004). doi:10.1145/1028099.1028101
Paskin N. (2002). Digital object identifiers. Inf. Services Use 22(2–3): 97–112
Payette, S., Staples, T.: The Mellon Fedora project. In: ECDL ’02: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, pp. 406–421 (2002)
Phelps, T.A., Wilensky, R.: Robust hyperlinks cost just five words each. Tech. Rep. UCB/CSD-00-1091, EECS Department, University of California, Berkeley (2000)
Plank J.S. (1997). A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems. Softw. Practice Experience 27(9): 995–1012
Postel, J.B.: Simple mail transfer protocol, Internet RFC-821 (1982)
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: VLDB ’01: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 129–138 (2001)
Rajasekar, A., Wan, M., Moore, R.: MySRB & SRB: Components of a data grid. In: HPDC ’02: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing HPDC-11 20002 (HPDC’02), pp. 301–310 (2002)
Rao H.C., Chen Y. and Chen M. (2001). A proxy-based personal web archiving service. SIGOPS Oper. Systems Rev. 35(1): 61–72.
Rauber, A., Aschenbrenner, A., Witvoet, O.: Austrian on-line archive processing: Analyzing archives of the World Wide Web. In: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2002), pp. 16–31. Rome, Italy (2002)
Rhea, S., Wells, C., Eaton, P., Geels, D., Zhao, B., Weatherspoon, H., Kubiatowicz, J.: Maintenance-free global data storage. IEEE Internet Comput. 5(5), 40–49 (2001). doi:10.1109/4236.957894
RLG: Preserving Digital Information: Report of the Task Force on Archiving of Digital Information. http://www.rlg.org/ArchTF/ (1996)
Rothenberg, J.: Avoiding technological quicksand: finding a viable technical foundation for digital preservation (1999). http://www.clir.org/PUBS/abstract/pub77.html
Rowstron, A., Druschel, P.: Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. In: SOSP ’01: Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles, pp. 188–201 (2001). doi:10.1145/502034.502053
Schonfeld, U., Bar-Yossef, Z., Keidar, I.: Do not crawl in the DUST: different URLs with similar text. In: WWW ’06: Proceedings of the 15th International Conference on World Wide Web, pp. 1015–1016 (2006). doi:10.1145/1135777.1135992
Shirky, C.: Aiht: Conceptual issues from practical tests. D-Lib Magaz. 11(12) (2005). doi:10.1045/december2005-shirky
Shivakumar, N., Garcia-Molina, H.: Finding near-replicas of documents and servers on the web. In: WebDB ’98: Selected Papers from the International Workshop on The World Wide Web and Databases, pp. 204–212 (1999)
Smith, J.A., Klein, M., Nelson, M.L.: Repository replication using NNTP and SMTP. In: ECDL ’06: Proceedings of the 10th European Conference on Research and Advanced Technology for Digital Libraries (2006)
Smith, J.A., Klein, M., Nelson, M.L.: Repository replication using NNTP and SMTP. Tech. Rep. arXiv cs.DL/0606008 (2006). http://arxiv.org/abs/cs.DL/0606008
Smith, J.A., McCown, F., Nelson, M.L.: Observed web robot behavior on decaying web subsites. D-Lib Magaz. 12(2) (2006). doi:10.1045/february2006-smith
Spinellis D. (2003). The decay and failures of web references. Commun. ACM 46(1): 71–77. doi:10.1145/602421.602422
Tansley, R., Bass, M., Stuve, D., Branschofsky, M., Chudnov, D., McClellan, G., Smith, M.: The DSpace institutional digital repository system: current functionality. In: JCDL ’03: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 87–97 (2003)
Thati, P., Chang, P.H., Agha, G.: Crawlets: Agents for high performance web search engines. In: MA 2001: Proceedings of the 5th International Conference on Mobile Agents, vol. 2240 (2001)
Van de Sompel, H., Lagoze, C.: Notes from the interoperability front: A progress report on the Open Archives Initiative. In: ECDL ’02: Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, pp. 144–157 (2002)
Van de Sompel, H., Nelson, M.L., Lagoze, C., Warner, S.: Resource harvesting within the OAI-PMH framework. D-Lib Magaz. 10(12) (2004). doi:10.1045/december2004-vandesompel
Van de Walle, R., Burnett, I., Dury, G.: ISO/IEC 21000-2 Digital Item Declaration (Output Document of the 70th MPEG Meeting, Palma De Mallorca, Spain, No. ISO/IEC JTC1/SC29/WG11/N6770) (2004)
Young, J.: OAIHarvester2. http://www.oclc.org/research/ software/oai/harvester2.htm (2005)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nelson, M.L., McCown, F., Smith, J.A. et al. Using the web infrastructure to preserve web pages. Int J Digit Libr 6, 327–349 (2007). https://doi.org/10.1007/s00799-007-0012-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-007-0012-y