Abstract
Web archives and Digital Libraries are conceptually similar, as they both store and provide access to digital contents. The process of loading documents into a Digital Library usually requires a strong intervention from human experts. However, large collections of documents gathered from the web must be loaded without human intervention. This paper analyzes strategies to select contents for a national web archive and proposes a system architecture to support it.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abiteboul, S., Cobéna, G., Masanes, J., Sedrati, G.: A First Experience in Archiving the French Web. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 1–15. Springer, Heidelberg (2002)
Albertsen, K.: The paradigma web haravesting environment. In: Proceedings of 3rd ECDL Workshop on Web Archives, Trondheim, Norway (August 2003)
Campos, J.: Versus: a web repository. Master thesis (2003)
U.W.A. Consortium. Uk web archiving consortium: Project overview (January 2006), http://info.webarchive.org.uk/
P.D. Corporation. Perseus blog survey (September 2004)
Day, M.: Collecting and preserving the world wide web (2003), http://www.jisc.ac.uk/uploaded_documents/archiving_feasibility.pdf
Drugeon, T.: A technical approach for the french web legal deposit. In: 5th International Web Archiving Workshop (IWAW 2005), Viena, Austria (September 2005)
Entlich, R.: Bolg today, gone tomorrow? preservation of weblogs. RLG Diginews 8(4) (August 2004)
Gomes, D., Santos, A.L., Silva, M.J.: Managing duplicates in a web archive. In: Liebrock, L.M. (ed.) Proceedings of the 21st Annual ACM Symposium on Applied Computing (ACM-SAC 2006), Dijon, France (Aprill 2006)
Gomes, D., Silva, M.J.: Characterizing a national community web. ACM Trans. Inter. Tech. 5(3), 508–531 (2005)
Gordon Mohr, M.S.I.R., Kimpton, M.: Introdcution to heritrix, an archival quality web crawler. In: 4th International Web Archiving Workshop (IWAW 2004), Bath, UK, September 2004. Internet Archive, USA (2004)
Habib, M.A., Abrams, M.: Analysis of sources of latency in downloading web pages. In: WebNet, San Antonio, Texas, USA (November 2000)
Hakala, J.: Collecting and preserving the web: Developing and testing the nedilb harvester. RLG Diginews 5(2) (April 2001)
Hawking, D., Craswell, N.: Very large scale retrieval and web search. In: Voorhees, E., Harman, D. (eds.) The TREC Book. MIT Press, Cambridge (2004)
Heydon, A., Najork, M.: Mercator: A scalable, extensble web crawler. World Wide Web 2(4), 219–229 (1999)
Koster, M.: A standard for robot exclusion (June 1994), http://www.robotstxt.org/wc/norobots.html
Kunze, J., Arvidson, A., Mohr, G., Stack, M.: The WARC File Format (Version 0.8 rev B) (January 2006)
Marshak, M., Levy, H.: Evaluating web user perceived latency using server side measurements. Computer Communications 26(8), 872–887 (2003)
McCown, F.: Dynamic web file format transformations with grace. In: 5th International Web Archiving Workshop (IWAW 2005), Viena, Austria (September 2005)
National Library of Australia. Padi-Web archiving, January 18 (2006), http://www.nla.gov.au/padi/topics/92.html
Ntoulas, A., Cho, J., Olston, C.: What’s new on the Web?: the evolution of the web from a search engine perspective. In: Proceedings of the 13th international conference on World Wide Web, pp. 1–12. ACM Press, New York (2004)
Phillips, M.: PANDORA, Australia’s Web Archive, and the Digital Archiving System that Supports it. DigiCULT.Info, 24 (2003)
Rauber, A., Aschenbrenner, A., Witvoet, O.: Austrian on-line archive processing: Analyzing archives of the world wide web (2002)
Snyder, H., Rosenbaum, H.: How public is the web?: Robots, acces, and scholarly communication. Working paper WP-98-05, Center for Socila Informatics, Indiana University, Bloomington, IN USA 47405-1801 (January 1998)
The Library of Congress. Minerva home page (Mapping the internet electronic resources virtual archive, library of congress web archiving) (Januray 2006), http://lcweb2.loc.gov/cocoon/minerva/html/minerva-home.html
The Web Robots Pages. Html author’s guide to the robots meta tag (March 2005), http://www.robotstxt.org/wc/meta/-user.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gomes, D., Freitas, S., Silva, M.J. (2006). Design and Selection Criteria for a National Web Archive. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2006. Lecture Notes in Computer Science, vol 4172. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11863878_17
Download citation
DOI: https://doi.org/10.1007/11863878_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44636-1
Online ISBN: 978-3-540-44638-5
eBook Packages: Computer ScienceComputer Science (R0)