Skip to main content

Design and Selection Criteria for a National Web Archive

  • Conference paper
Research and Advanced Technology for Digital Libraries (ECDL 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4172))

Included in the following conference series:

Abstract

Web archives and Digital Libraries are conceptually similar, as they both store and provide access to digital contents. The process of loading documents into a Digital Library usually requires a strong intervention from human experts. However, large collections of documents gathered from the web must be loaded without human intervention. This paper analyzes strategies to select contents for a national web archive and proposes a system architecture to support it.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abiteboul, S., Cobéna, G., Masanes, J., Sedrati, G.: A First Experience in Archiving the French Web. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 1–15. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  2. Albertsen, K.: The paradigma web haravesting environment. In: Proceedings of 3rd ECDL Workshop on Web Archives, Trondheim, Norway (August 2003)

    Google Scholar 

  3. Campos, J.: Versus: a web repository. Master thesis (2003)

    Google Scholar 

  4. U.W.A. Consortium. Uk web archiving consortium: Project overview (January 2006), http://info.webarchive.org.uk/

  5. P.D. Corporation. Perseus blog survey (September 2004)

    Google Scholar 

  6. Day, M.: Collecting and preserving the world wide web (2003), http://www.jisc.ac.uk/uploaded_documents/archiving_feasibility.pdf

  7. Drugeon, T.: A technical approach for the french web legal deposit. In: 5th International Web Archiving Workshop (IWAW 2005), Viena, Austria (September 2005)

    Google Scholar 

  8. Entlich, R.: Bolg today, gone tomorrow? preservation of weblogs. RLG Diginews 8(4) (August 2004)

    Google Scholar 

  9. Gomes, D., Santos, A.L., Silva, M.J.: Managing duplicates in a web archive. In: Liebrock, L.M. (ed.) Proceedings of the 21st Annual ACM Symposium on Applied Computing (ACM-SAC 2006), Dijon, France (Aprill 2006)

    Google Scholar 

  10. Gomes, D., Silva, M.J.: Characterizing a national community web. ACM Trans. Inter. Tech. 5(3), 508–531 (2005)

    Article  Google Scholar 

  11. Gordon Mohr, M.S.I.R., Kimpton, M.: Introdcution to heritrix, an archival quality web crawler. In: 4th International Web Archiving Workshop (IWAW 2004), Bath, UK, September 2004. Internet Archive, USA (2004)

    Google Scholar 

  12. Habib, M.A., Abrams, M.: Analysis of sources of latency in downloading web pages. In: WebNet, San Antonio, Texas, USA (November 2000)

    Google Scholar 

  13. Hakala, J.: Collecting and preserving the web: Developing and testing the nedilb harvester. RLG Diginews 5(2) (April 2001)

    Google Scholar 

  14. Hawking, D., Craswell, N.: Very large scale retrieval and web search. In: Voorhees, E., Harman, D. (eds.) The TREC Book. MIT Press, Cambridge (2004)

    Google Scholar 

  15. Heydon, A., Najork, M.: Mercator: A scalable, extensble web crawler. World Wide Web 2(4), 219–229 (1999)

    Article  Google Scholar 

  16. Koster, M.: A standard for robot exclusion (June 1994), http://www.robotstxt.org/wc/norobots.html

  17. Kunze, J., Arvidson, A., Mohr, G., Stack, M.: The WARC File Format (Version 0.8 rev B) (January 2006)

    Google Scholar 

  18. Marshak, M., Levy, H.: Evaluating web user perceived latency using server side measurements. Computer Communications 26(8), 872–887 (2003)

    Article  Google Scholar 

  19. McCown, F.: Dynamic web file format transformations with grace. In: 5th International Web Archiving Workshop (IWAW 2005), Viena, Austria (September 2005)

    Google Scholar 

  20. National Library of Australia. Padi-Web archiving, January 18 (2006), http://www.nla.gov.au/padi/topics/92.html

  21. Ntoulas, A., Cho, J., Olston, C.: What’s new on the Web?: the evolution of the web from a search engine perspective. In: Proceedings of the 13th international conference on World Wide Web, pp. 1–12. ACM Press, New York (2004)

    Google Scholar 

  22. Phillips, M.: PANDORA, Australia’s Web Archive, and the Digital Archiving System that Supports it. DigiCULT.Info, 24 (2003)

    Google Scholar 

  23. Rauber, A., Aschenbrenner, A., Witvoet, O.: Austrian on-line archive processing: Analyzing archives of the world wide web (2002)

    Google Scholar 

  24. Snyder, H., Rosenbaum, H.: How public is the web?: Robots, acces, and scholarly communication. Working paper WP-98-05, Center for Socila Informatics, Indiana University, Bloomington, IN USA 47405-1801 (January 1998)

    Google Scholar 

  25. The Library of Congress. Minerva home page (Mapping the internet electronic resources virtual archive, library of congress web archiving) (Januray 2006), http://lcweb2.loc.gov/cocoon/minerva/html/minerva-home.html

  26. The Web Robots Pages. Html author’s guide to the robots meta tag (March 2005), http://www.robotstxt.org/wc/meta/-user.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gomes, D., Freitas, S., Silva, M.J. (2006). Design and Selection Criteria for a National Web Archive. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2006. Lecture Notes in Computer Science, vol 4172. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11863878_17

Download citation

  • DOI: https://doi.org/10.1007/11863878_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44636-1

  • Online ISBN: 978-3-540-44638-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics